annotate src/lib/unichar.h @ 7185:6f014a866f38 HEAD

Replace invalid UTF8 input with a replacement character.
author Timo Sirainen <tss@iki.fi>
date Tue, 22 Jan 2008 09:31:59 +0200
parents dcbf6afdf931
children 81806d402514
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
6410
e4eb71ae8e96 Changed .h ifdef/defines to use <NAME>_H format.
Timo Sirainen <tss@iki.fi>
parents: 6129
diff changeset
1 #ifndef UNICHAR_H
e4eb71ae8e96 Changed .h ifdef/defines to use <NAME>_H format.
Timo Sirainen <tss@iki.fi>
parents: 6129
diff changeset
2 #define UNICHAR_H
4899
c98008a7e9b7 Added unichar_t UCS-4 type and some ucs4/utf8 functions.
Timo Sirainen <tss@iki.fi>
parents:
diff changeset
3
7185
6f014a866f38 Replace invalid UTF8 input with a replacement character.
Timo Sirainen <tss@iki.fi>
parents: 7042
diff changeset
4 /* Character used to replace invalid input. */
6f014a866f38 Replace invalid UTF8 input with a replacement character.
Timo Sirainen <tss@iki.fi>
parents: 7042
diff changeset
5 #define UNICODE_REPLACEMENT_CHAR 0xfffd
6f014a866f38 Replace invalid UTF8 input with a replacement character.
Timo Sirainen <tss@iki.fi>
parents: 7042
diff changeset
6
4899
c98008a7e9b7 Added unichar_t UCS-4 type and some ucs4/utf8 functions.
Timo Sirainen <tss@iki.fi>
parents:
diff changeset
7 typedef uint32_t unichar_t;
7042
dcbf6afdf931 Define unichars array type and use it for uni_utf8_to_ucs4() output.
Timo Sirainen <tss@iki.fi>
parents: 6952
diff changeset
8 ARRAY_DEFINE_TYPE(unichars, unichar_t);
4899
c98008a7e9b7 Added unichar_t UCS-4 type and some ucs4/utf8 functions.
Timo Sirainen <tss@iki.fi>
parents:
diff changeset
9
5683
8101787cdd1c Rewrote some code and cleaned up the API
Timo Sirainen <tss@iki.fi>
parents: 4899
diff changeset
10 extern const uint8_t *const uni_utf8_non1_bytes;
4899
c98008a7e9b7 Added unichar_t UCS-4 type and some ucs4/utf8 functions.
Timo Sirainen <tss@iki.fi>
parents:
diff changeset
11
c98008a7e9b7 Added unichar_t UCS-4 type and some ucs4/utf8 functions.
Timo Sirainen <tss@iki.fi>
parents:
diff changeset
12 /* Returns number of characters in a NUL-terminated unicode string */
c98008a7e9b7 Added unichar_t UCS-4 type and some ucs4/utf8 functions.
Timo Sirainen <tss@iki.fi>
parents:
diff changeset
13 unsigned int uni_strlen(const unichar_t *str);
c98008a7e9b7 Added unichar_t UCS-4 type and some ucs4/utf8 functions.
Timo Sirainen <tss@iki.fi>
parents:
diff changeset
14 /* Translates UTF-8 input to UCS-4 output. Returns 0 if ok, -1 if input was
c98008a7e9b7 Added unichar_t UCS-4 type and some ucs4/utf8 functions.
Timo Sirainen <tss@iki.fi>
parents:
diff changeset
15 invalid */
7042
dcbf6afdf931 Define unichars array type and use it for uni_utf8_to_ucs4() output.
Timo Sirainen <tss@iki.fi>
parents: 6952
diff changeset
16 int uni_utf8_to_ucs4(const char *input, ARRAY_TYPE(unichars) *output);
4899
c98008a7e9b7 Added unichar_t UCS-4 type and some ucs4/utf8 functions.
Timo Sirainen <tss@iki.fi>
parents:
diff changeset
17 /* Translates UCS-4 input to UTF-8 output. */
c98008a7e9b7 Added unichar_t UCS-4 type and some ucs4/utf8 functions.
Timo Sirainen <tss@iki.fi>
parents:
diff changeset
18 void uni_ucs4_to_utf8(const unichar_t *input, size_t len, buffer_t *output);
5683
8101787cdd1c Rewrote some code and cleaned up the API
Timo Sirainen <tss@iki.fi>
parents: 4899
diff changeset
19 void uni_ucs4_to_utf8_c(unichar_t chr, buffer_t *output);
4899
c98008a7e9b7 Added unichar_t UCS-4 type and some ucs4/utf8 functions.
Timo Sirainen <tss@iki.fi>
parents:
diff changeset
20
5683
8101787cdd1c Rewrote some code and cleaned up the API
Timo Sirainen <tss@iki.fi>
parents: 4899
diff changeset
21 /* Returns 1 if *chr_r is set, 0 for incomplete trailing character,
8101787cdd1c Rewrote some code and cleaned up the API
Timo Sirainen <tss@iki.fi>
parents: 4899
diff changeset
22 -1 for invalid input. */
8101787cdd1c Rewrote some code and cleaned up the API
Timo Sirainen <tss@iki.fi>
parents: 4899
diff changeset
23 int uni_utf8_get_char(const char *input, unichar_t *chr_r);
8101787cdd1c Rewrote some code and cleaned up the API
Timo Sirainen <tss@iki.fi>
parents: 4899
diff changeset
24 int uni_utf8_get_char_n(const void *input, size_t max_len, unichar_t *chr_r);
4899
c98008a7e9b7 Added unichar_t UCS-4 type and some ucs4/utf8 functions.
Timo Sirainen <tss@iki.fi>
parents:
diff changeset
25 /* Returns UTF-8 string length with maximum input size. */
c98008a7e9b7 Added unichar_t UCS-4 type and some ucs4/utf8 functions.
Timo Sirainen <tss@iki.fi>
parents:
diff changeset
26 unsigned int uni_utf8_strlen_n(const void *input, size_t size);
c98008a7e9b7 Added unichar_t UCS-4 type and some ucs4/utf8 functions.
Timo Sirainen <tss@iki.fi>
parents:
diff changeset
27
5683
8101787cdd1c Rewrote some code and cleaned up the API
Timo Sirainen <tss@iki.fi>
parents: 4899
diff changeset
28 /* Returns the number of bytes belonging to this partial UTF-8 character.
8101787cdd1c Rewrote some code and cleaned up the API
Timo Sirainen <tss@iki.fi>
parents: 4899
diff changeset
29 Invalid input is returned with length 1. */
8101787cdd1c Rewrote some code and cleaned up the API
Timo Sirainen <tss@iki.fi>
parents: 4899
diff changeset
30 static inline unsigned int uni_utf8_char_bytes(char chr)
8101787cdd1c Rewrote some code and cleaned up the API
Timo Sirainen <tss@iki.fi>
parents: 4899
diff changeset
31 {
8101787cdd1c Rewrote some code and cleaned up the API
Timo Sirainen <tss@iki.fi>
parents: 4899
diff changeset
32 /* 0x00 .. 0x7f are ASCII. 0x80 .. 0xC1 are invalid. */
8101787cdd1c Rewrote some code and cleaned up the API
Timo Sirainen <tss@iki.fi>
parents: 4899
diff changeset
33 if ((uint8_t)chr < (192 + 2))
8101787cdd1c Rewrote some code and cleaned up the API
Timo Sirainen <tss@iki.fi>
parents: 4899
diff changeset
34 return 1;
8101787cdd1c Rewrote some code and cleaned up the API
Timo Sirainen <tss@iki.fi>
parents: 4899
diff changeset
35 return uni_utf8_non1_bytes[(uint8_t)chr - (192 + 2)];
8101787cdd1c Rewrote some code and cleaned up the API
Timo Sirainen <tss@iki.fi>
parents: 4899
diff changeset
36 }
4899
c98008a7e9b7 Added unichar_t UCS-4 type and some ucs4/utf8 functions.
Timo Sirainen <tss@iki.fi>
parents:
diff changeset
37
6129
04b9eb27283c Added uni_ucs4_to_titlecase() and uni_utf8_to_decomposed_titlecase(). They
Timo Sirainen <tss@iki.fi>
parents: 5683
diff changeset
38 /* Return given character in titlecase. */
04b9eb27283c Added uni_ucs4_to_titlecase() and uni_utf8_to_decomposed_titlecase(). They
Timo Sirainen <tss@iki.fi>
parents: 5683
diff changeset
39 unichar_t uni_ucs4_to_titlecase(unichar_t chr);
04b9eb27283c Added uni_ucs4_to_titlecase() and uni_utf8_to_decomposed_titlecase(). They
Timo Sirainen <tss@iki.fi>
parents: 5683
diff changeset
40
04b9eb27283c Added uni_ucs4_to_titlecase() and uni_utf8_to_decomposed_titlecase(). They
Timo Sirainen <tss@iki.fi>
parents: 5683
diff changeset
41 /* Convert UTF-8 input to titlecase and decompose the titlecase characters to
04b9eb27283c Added uni_ucs4_to_titlecase() and uni_utf8_to_decomposed_titlecase(). They
Timo Sirainen <tss@iki.fi>
parents: 5683
diff changeset
42 output buffer. Returns 0 if ok, -1 if input was invalid. This generates
7185
6f014a866f38 Replace invalid UTF8 input with a replacement character.
Timo Sirainen <tss@iki.fi>
parents: 7042
diff changeset
43 output that's compatible with i;unicode-casemap comparator. Invalid input
6f014a866f38 Replace invalid UTF8 input with a replacement character.
Timo Sirainen <tss@iki.fi>
parents: 7042
diff changeset
44 is replaced with unicode replacement character (0xfffd). */
6129
04b9eb27283c Added uni_ucs4_to_titlecase() and uni_utf8_to_decomposed_titlecase(). They
Timo Sirainen <tss@iki.fi>
parents: 5683
diff changeset
45 int uni_utf8_to_decomposed_titlecase(const void *input, size_t max_len,
04b9eb27283c Added uni_ucs4_to_titlecase() and uni_utf8_to_decomposed_titlecase(). They
Timo Sirainen <tss@iki.fi>
parents: 5683
diff changeset
46 buffer_t *output);
04b9eb27283c Added uni_ucs4_to_titlecase() and uni_utf8_to_decomposed_titlecase(). They
Timo Sirainen <tss@iki.fi>
parents: 5683
diff changeset
47
7185
6f014a866f38 Replace invalid UTF8 input with a replacement character.
Timo Sirainen <tss@iki.fi>
parents: 7042
diff changeset
48 /* If input contains only valid UTF-8 characters, return TRUE without updating
6f014a866f38 Replace invalid UTF8 input with a replacement character.
Timo Sirainen <tss@iki.fi>
parents: 7042
diff changeset
49 buf. If input contains invalid UTF-8 characters, replace them with unicode
6f014a866f38 Replace invalid UTF8 input with a replacement character.
Timo Sirainen <tss@iki.fi>
parents: 7042
diff changeset
50 replacement character (0xfffd), write the output to buf and return FALSE. */
6952
08e4d7efcd6a uni_utf8_get_valid_data() API changed.
Timo Sirainen <tss@iki.fi>
parents: 6951
diff changeset
51 bool uni_utf8_get_valid_data(const unsigned char *input, size_t size,
08e4d7efcd6a uni_utf8_get_valid_data() API changed.
Timo Sirainen <tss@iki.fi>
parents: 6951
diff changeset
52 buffer_t *buf);
6951
1f70c72e4312 Moved uni_utf8_get_valid_data() to lib/
Timo Sirainen <tss@iki.fi>
parents: 6410
diff changeset
53
4899
c98008a7e9b7 Added unichar_t UCS-4 type and some ucs4/utf8 functions.
Timo Sirainen <tss@iki.fi>
parents:
diff changeset
54 #endif