GNU C Library (libc) Programming Guide - The message catalog files

Next: The gencat program, Previous: The catgets Functions, Up: Message catalogs a la X/Open

8.1.2 Format of the message catalog files

The only reasonable way the translate all the messages of a function and store the result in a message catalog file which can be read by the catopen function is to write all the message text to the translator and let her/him translate them all. I.e., we must have a file with entries which associate the set/message tuple with a specific translation. This file format is specified in the X/Open standard and is as follows:

Lines containing only whitespace characters or empty lines are ignored.
Lines which contain as the first non-whitespace character a $ followed by a whitespace character are comment and are also ignored.
If a line contains as the first non-whitespace characters the sequence $set followed by a whitespace character an additional argument is required to follow. This argument can either be:
- a number. In this case the value of this number determines the set to which the following messages are added.
- an identifier consisting of alphanumeric characters plus the underscore character. In this case the set get automatically a number assigned. This value is one added to the largest set number which so far appeared.
  How to use the symbolic names is explained in section Common Usage.
  It is an error if a symbol name appears more than once. All following messages are placed in a set with this number.
If a line contains as the first non-whitespace characters the sequence $delset followed by a whitespace character an additional argument is required to follow. This argument can either be:
- a number. In this case the value of this number determines the set which will be deleted.
- an identifier consisting of alphanumeric characters plus the underscore character. This symbolic identifier must match a name for a set which previously was defined. It is an error if the name is unknown.
In both cases all messages in the specified set will be removed. They will not appear in the output. But if this set is later again selected with a $set command again messages could be added and these messages will appear in the output.
If a line contains after leading whitespaces the sequence $quote, the quoting character used for this input file is changed to the first non-whitespace character following the $quote. If no non-whitespace character is present before the line ends quoting is disable.
By default no quoting character is used. In this mode strings are terminated with the first unescaped line break. If there is a $quote sequence present newline need not be escaped. Instead a string is terminated with the first unescaped appearance of the quote character.
A common usage of this feature would be to set the quote character to ". Then any appearance of the " in the strings must be escaped using the backslash (i.e., \" must be written).
Any other line must start with a number or an alphanumeric identifier (with the underscore character included). The following characters (starting after the first whitespace character) will form the string which gets associated with the currently selected set and the message number represented by the number and identifier respectively.
If the start of the line is a number the message number is obvious. It is an error if the same message number already appeared for this set.
If the leading token was an identifier the message number gets automatically assigned. The value is the current maximum messages number for this set plus one. It is an error if the identifier was already used for a message in this set. It is OK to reuse the identifier for a message in another thread. How to use the symbolic identifiers will be explained below (see Common Usage). There is one limitation with the identifier: it must not be Set. The reason will be explained below.
The text of the messages can contain escape characters. The usual bunch of characters known from the ISO C language are recognized (\n, \t, \v, \b, \r, \f, \\, and \nnn, where nnn is the octal coding of a character code).

Important: The handling of identifiers instead of numbers for the set and messages is a GNU extension. Systems strictly following the X/Open specification do not have this feature. An example for a message catalog file is this:

     $ This is a leading comment.
     $quote "
     
     $set SetOne
     1 Message with ID 1.
     two "   Message with ID \"two\", which gets the value 2 assigned"
     
     $set SetTwo
     $ Since the last set got the number 1 assigned this set has number 2.
     4000 "The numbers can be arbitrary, they need not start at one."

This small example shows various aspects:

Lines 1 and 9 are comments since they start with $ followed by a whitespace.
The quoting character is set to ". Otherwise the quotes in the message definition would have to be left away and in this case the message with the identifier two would loose its leading whitespace.
Mixing numbered messages with message having symbolic names is no problem and the numbering happens automatically.

While this file format is pretty easy it is not the best possible for use in a running program. The catopen function would have to parser the file and handle syntactic errors gracefully. This is not so easy and the whole process is pretty slow. Therefore the catgets functions expect the data in another more compact and ready-to-use file format. There is a special program gencat which is explained in detail in the next section.

Files in this other format are not human readable. To be easy to use by programs it is a binary file. But the format is byte order independent so translation files can be shared by systems of arbitrary architecture (as long as they use the GNU C Library).

Details about the binary file format are not important to know since these files are always created by the gencat program. The sources of the GNU C Library also provide the sources for the gencat program and so the interested reader can look through these source files to learn about the file format.