Internationalization

Character classification and conversion

The ANSI C functions declared in the <ctype.h> header file classify or convert character-coded integer values according to type and conversion information in the program's locale. All the classification functions except isdigit and isxdigit can return nonzero (true) for single-byte supplementary code set characters when the LC_CTYPE category of the current locale is other than ``"C"''. In a Spanish locale, isalpha('n[~]') should be true. Similarly, the case conversion functions toupper and tolower will appropriately convert any single-byte supplementary code set characters identified by the isalpha function.

The point of these functions is to let you determine a character's type or case without reference to its numeric value in a given code set. Whereas a program written for a US ASCII environment might test whether a character is printable with the code

   if ( c <= 037 || c == 0177 )

a codeset-independent program will use isprint:

   if ( !isprint(c) )

Similarly,

   c = toupper(c);

will do the same thing as

   if( c >= 'a' && c <= 'z')
   	c += 'a' -'A';

without relying on the fact that upper- and lower case characters are numerically contiguous in the US ASCII code set.

The <ctype.h> functions are almost always macros that are implemented using table lookups indexed by the character argument. Their behavior is changed by resetting the table(s) to the new locale's values, so there should be no performance impact. The classification functions are described on the ctype(3C) manual page, the conversion functions on the conv(3C) page. Both single- and multibyte character classification and conversion routines are declared in the <wchar.h> header, and described on the pages wctype(3C) and wconv(3C). Note that the multibyte routines are not part of the ANSI C standard, nor are the single-byte functions isascii and toascii.

Sign extension

In some C language implementations, character variables that are not explicitly declared signed or unsigned are treated as nonnegative quantities with a range typically from 0 to 255. In other implementations, they are treated as signed quantities with a range typically from -128 to 127. When a signed object of type char is converted to a wider integer, the machine is obliged to propagate the sign, which is encoded in the high-order bit of the new integer object. If the character variable holds an eight-bit character with the high-order bit set, the sign bit will be propagated the full width of an object of type int or long, producing a negative value.

You can avoid this problem (which typically occurs with the ctype functions) by declaring as unsigned any object of type char that is liable to be converted to a wider integer. In the example we showed earlier, for instance, the declaration of the character pointer as of type unsigned char would guarantee that on any implementation the values pointed at will be nonnegative.

Characters used as indices

A related problem arises when characters are used as indices into arrays and tables. If a table has been defined to contain only 128 possible characters, the amount of allocated memory will be exceeded if an eight-bit character whose value is greater than 127 is used as an index. Moreover, if the character is signed, the index may be negative.

The solution, at least when dealing with 8-bit code sets, is obviously to increase the size of the table from the 7-bit maximum of 128 to the 8-bit maximum of 256. And again, to declare the object that will hold the character as type unsigned char.