Internationalization

Character representation

UnixWare can represent up to four code sets concurrently in an 8-bit byte stream. The code sets are configured in a scheme called ``extended UNIX code,'' or EUC. The primary code set (code set 0) is always 7-bit US ASCII. Each byte of any character in a supplementary code set (code sets 1,2, or 3) has the high-order bit set; code sets 2 and 3 are distinguished from code set 1 and each other by their use of a special ``shift byte'' before each character.

EUC code set representations

Code set EUC representation

0 0xxxxxxx

1 1xxxxxxx [ 1xxxxxxx [...]]

2 SS2 1xxxxxxx [ 1xxxxxxx [...]]

3 SS3 1xxxxxxx [ 1xxxxxxx [...]]

Code set	EUC representation
0	`0xxxxxxx`
1	`1xxxxxxx [ 1xxxxxxx [...]]`
2	`SS2 1xxxxxxx [ 1xxxxxxx [...]]`
3	`SS3 1xxxxxxx [ 1xxxxxxx [...]]`

SS2 is represented in hexadecimal by 0x8e, SS3 by 0x8f.

EUC is provided mainly to support the huge number of ideograms needed for I/O in an Asian-language environment. To work within the constraints of usual computer architectures, these ideograms are encoded as sequences of bytes, or ``multibyte characters.'' Because single-byte characters (the digits 0-9, say) can be intermixed with multibyte characters, the sequence of bytes needed to encode an ideogram must be self-identifying: regardless of the supplementary code set used, each byte of a multibyte character will have the high-order bit set; if code sets 2 or 3 are used, each multibyte character will also be preceded by a shift byte. In a moment, we will take a closer look at multibyte characters and at the implementation-defined integral type wchar_t that lets you manipulate variable width characters as uniformly sized data objects called ``wide characters.'' We will also discuss the functions you use to manage multibyte and wide characters.

Of course, programmers developing applications for less complex linguistic environments need not concern themselves with the details of multibyte or wide character processing. In Europe, for instance, a single 8-bit code set can hold all the characters of the major languages. In these environments, at least one 8-bit character set will be represented in the EUC code sets, usually code sets 0 and 1. Other character sets may be represented simultaneously, in various combinations. Applications will work correctly with any standard 7- or 8-bit character set, provided (1) they are ``8-bit clean'' -- they make no assumptions about the contents of the high-order bit when processing characters; and (2) they use correctly the functions supplied by the interface for codeset-dependent tasks -- character classification and conversion, in other words. We will take a brief look at these issues now.