Follow Techotopia on Twitter

On-line Guides
All Guides
eBook Store
iOS / Android
Linux for Beginners
Office Productivity
Linux Installation
Linux Security
Linux Utilities
Linux Virtualization
Linux Kernel
System/Network Admin
Programming
Scripting Languages
Development Tools
Web Development
GUI Toolkits/Desktop
Databases
Mail Systems
openSolaris
Eclipse Documentation
Techotopia.com
Virtuatopia.com
Answertopia.com

How To Guides
Virtualization
General System Admin
Linux Security
Linux Filesystems
Web Servers
Graphics & Desktop
PC Hardware
Windows
Problem Solutions
Privacy Policy

  




 

 

The Art of Unix Programming
Prev Home Next


Unix Programming - Internationalization

An in-depth discussion of code internationalization — designing software so the interface readily incorporates multiple languages and the vagaries of different character sets — would be out of scope for this book. However, a few lessons for good practice do stand out from Unix experience.

First, separate the message base from the code . Good Unix practice is to separate the message strings a program uses from its code. so that message dictionaries in other languages can be plugged in without modifying the code.

The best-known tool for this job is GNU gettext, which requires that you wrap native-language strings that need to be internationalized in a special macro. The macro uses each string as a key into per-language dictionaries which can be supplied as separate files. If no such dictionaries are available (or if they are but the string lookup does not return a match), the macro simply returns its argument, implicitly falling back on the native language in the code.

While gettext itself is messy and fragile as of mid-2003, its general philosophy is sound. For many projects, it is possible to craft a lighter-weight version of this idea with good results.

Second, there is a clear trend in modern Unixes to scrap all the historical cruft associated with multiple character sets and make applications natively speak UTF-8, the 8-bit shift encoding of the Unicode character set (as opposed to, say, making them natively speak 16-bit wide characters). The low 128 characters of UTF-8 are ASCII, and the low 256 are Latin-1, which means this choice is backward-compatible with the two most widely used character sets. The fact that XML and Java have made this choice helps, but the momentum is present even where XML and Java are not.

Third, beware of character ranges in regular expressions. The element [a-z] will not necessarily catch all lower-case letters if the script or program it's in is applied to (say) German, where the sharp-s or character is considered lower-case but does not fall in that range; similar problems arise with French accented letters. Its safer to use [[:lower:]]. and other symbolic ranges described in the POSIX standard.


[an error occurred while processing this directive]
The Art of Unix Programming
Prev Home Next

 
 
  Published under free license. Design by Interspire