DOC HOME SITE MAP MAN PAGES GNU INFO SEARCH PRINT BOOK
 
Internationalization

``8-bit clean''

UnixWare system applications written for 7-bit US ASCII environments have sometimes assumed that the high-order bit is available for purposes other than character processing. In data communications, for instance, it was often used as a parity bit. On receipt and after a parity check, the high-order bit was stripped either by the line discipline or the program to obtain the original 7-bit character:

   char c;
   /* bitwise AND with octal value 177 strips high-order bit */
   c &= 0177;	
Other programs used the high-order bit as a private data storage area, usually to test a flag:
   char c;
   /*...*/
   c |= 0200;	/* bitwise OR with octal value 200 sets flag */
   /*...*/
   c &= 0177;	/* bitwise AND removes flag */
   /*...*/
   if (c & 0200)	/* test if flag set */
   {
   /*...*/
   }
   c &= 0177;	/* original character */

Neither of these practices will work with 8-bit or larger code sets. To show you how to store data in a codeset-independent way, we will look at code fragments from a UnixWare system program before and after it was made 8-bit clean. In the first fragment, the program sets the high-order bit of characters quoted on the command line:

#define LITERAL '\''
#define QUOTE 0200
register int c;
register char *argp = arg->argval;

if (c == LITERAL) /* character is a single quote */ { /* get next character until next single quote */ while ((c = getc()) && c != LITERAL) { *argp++ = (c | QUOTE); } }

In the next fragment, the same data are stored by internally placing backslashes before quoted characters in the command string:

#define LITERAL '\''
register int c;
register unsigned char *argp = arg->argval;

if (c == LITERAL) { while ((c = getc()) && c != LITERAL) { /* precede each character within single quotes with a backslash */ *argp++ = '\\'; *argp++ = c; } }

Because the data are stored in 8-bit character values rather than the high-order bit of the quoted characters, the program will work correctly with code sets other than US ASCII. Note, by the way, the use of the type unsigned char in the declaration of the character pointer in the second fragment. We will discuss the reasons why you use it in the next section.


Next topic: Character classification and conversion
Previous topic: EUC code set representations

© 2004 The SCO Group, Inc. All rights reserved.
UnixWare 7 Release 7.1.4 - 27 April 2004