6.5.4 The iconv Implementation in the GNU C library
After reading about the problems of iconv implementations in the
last section it is certainly good to note that the implementation in
the GNU C library has none of the problems mentioned above. What
follows is a step-by-step analysis of the points raised above. The
evaluation is based on the current state of the development (as of
January 1999). The development of the iconv functions is not
complete, but basic functionality has solidified.
The GNU C library's iconv implementation uses shared loadable
modules to implement the conversions. A very small number of
conversions are built into the library itself but these are only rather
All the benefits of loadable modules are available in the GNU C library
implementation. This is especially appealing since the interface is
well documented (see below), and it, therefore, is easy to write new
conversion modules. The drawback of using loadable objects is not a
problem in the GNU C library, at least on ELF systems. Since the
library is able to load shared objects even in statically linked
binaries, static linking need not be forbidden in case one wants to use
The second mentioned problem is the number of supported conversions.
Currently, the GNU C library supports more than 150 character sets. The
way the implementation is designed the number of supported conversions
is greater than 22350 (150 times 149). If any conversion
from or to a character set is missing, it can be added easily.
Particularly impressive as it may be, this high number is due to the
fact that the GNU C library implementation of iconv does not have
the third problem mentioned above (i.e., whenever there is a conversion
from a character set A to B and from
B to C it is always possible to convert from
A to C directly). If the iconv_open
returns an error and sets errno to EINVAL, there is no
known way, directly or indirectly, to perform the wanted conversion.
Triangulation is achieved by providing for each character set a
conversion from and to UCS-4 encoded ISO 10646. Using ISO 10646
as an intermediate representation it is possible to triangulate
(i.e., convert with an intermediate representation).
There is no inherent requirement to provide a conversion to ISO 10646 for a new character set, and it is also possible to provide other
conversions where neither source nor destination character set is ISO 10646. The existing set of conversions is simply meant to cover all
conversions that might be of interest.
All currently available conversions use the triangulation method above,
making conversion run unnecessarily slow. If, for example, somebody
often needs the conversion from ISO-2022-JP to EUC-JP, a quicker solution
would involve direct conversion between the two character sets, skipping
the input to ISO 10646 first. The two character sets of interest
are much more similar to each other than to ISO 10646.
In such a situation one easily can write a new conversion and provide it
as a better alternative. The GNU C library iconv implementation
would automatically use the module implementing the conversion if it is
specified to be more efficient.
22.214.171.124 Format of gconv-modules files
All information about the available conversions comes from a file named
gconv-modules, which can be found in any of the directories along
the GCONV_PATH. The gconv-modules files are line-oriented
text files, where each of the lines has one of the following formats:
If the first non-whitespace character is a # the line contains only
comments and is ignored.
Lines starting with alias define an alias name for a character
set. Two more words are expected on the line. The first word
defines the alias name, and the second defines the original name of the
character set. The effect is that it is possible to use the alias name
in the fromset or toset parameters of iconv_open and
achieve the same result as when using the real character set name.
This is quite important as a character set has often many different
names. There is normally an official name but this need not correspond to
the most popular name. Beside this many character sets have special
names that are somehow constructed. For example, all character sets
specified by the ISO have an alias of the form ISO-IR-nnn
where nnn is the registration number. This allows programs that
know about the registration number to construct character set names and
use them in iconv_open calls. More on the available names and
aliases follows below.
Lines starting with module introduce an available conversion
module. These lines must contain three or four more words.
The first word specifies the source character set, the second word the
destination character set of conversion implemented in this module, and
the third word is the name of the loadable module. The filename is
constructed by appending the usual shared object suffix (normally
.so) and this file is then supposed to be found in the same
directory the gconv-modules file is in. The last word on the line,
which is optional, is a numeric value representing the cost of the
conversion. If this word is missing, a cost of 1 is assumed. The
numeric value itself does not matter that much; what counts are the
relative values of the sums of costs for all possible conversion paths.
Below is a more precise description of the use of the cost value.
Returning to the example above where one has written a module to directly
convert from ISO-2022-JP to EUC-JP and back. All that has to be done is
to put the new module, let its name be ISO2022JP-EUCJP.so, in a directory
and add a file gconv-modules with the following content in the
To see why this is sufficient, it is necessary to understand how the
conversion used by iconv (and described in the descriptor) is
selected. The approach to this problem is quite simple.
At the first call of the iconv_open function the program reads
all available gconv-modules files and builds up two tables: one
containing all the known aliases and another that contains the
information about the conversions and which shared object implements
126.96.36.199 Finding the conversion path in iconv
The set of available conversions form a directed graph with weighted
edges. The weights on the edges are the costs specified in the
gconv-modules files. The iconv_open function uses an
algorithm suitable for search for the best path in such a graph and so
constructs a list of conversions that must be performed in succession
to get the transformation from the source to the destination character
Explaining why the above gconv-modules files allows the
iconv implementation to resolve the specific ISO-2022-JP to
EUC-JP conversion module instead of the conversion coming with the
library itself is straightforward. Since the latter conversion takes two
steps (from ISO-2022-JP to ISO 10646 and then from ISO 10646 to
EUC-JP), the cost is 1+1 = 2. The above gconv-modules
file, however, specifies that the new conversion modules can perform this
conversion with only the cost of 1.
A mysterious item about the gconv-modules file above (and also
the file coming with the GNU C library) are the names of the character
sets specified in the module lines. Why do almost all the names
end in //? And this is not all: the names can actually be
regular expressions. At this point in time this mystery should not be
revealed, unless you have the relevant spell-casting materials: ashes
from an original DOS 6.2 boot disk burnt in effigy, a crucifix
blessed by St. Emacs, assorted herbal roots from Central America, sand
from Cebu, etc. Sorry! The part of the implementation where
this is used is not yet finished. For now please simply follow the
existing examples. It'll become clearer once it is. –drepper
A last remark about the gconv-modules is about the names not
ending with //. A character set named INTERNAL is often
mentioned. From the discussion above and the chosen name it should have
become clear that this is the name for the representation used in the
intermediate step of the triangulation. We have said that this is UCS-4
but actually that is not quite right. The UCS-4 specification also
includes the specification of the byte ordering used. Since a UCS-4 value
consists of four bytes, a stored value is effected by byte ordering. The
internal representation is not the same as UCS-4 in case the byte
ordering of the processor (or at least the running process) is not the
same as the one required for UCS-4. This is done for performance reasons
as one does not want to perform unnecessary byte-swapping operations if
one is not interested in actually seeing the result in UCS-4. To avoid
trouble with endianess, the internal representation consistently is named
INTERNAL even on big-endian systems where the representations are
188.8.131.52 iconv module data structures
So far this section has described how modules are located and considered
to be used. What remains to be described is the interface of the modules
so that one can write new ones. This section describes the interface as
it is in use in January 1999. The interface will change a bit in the
future but, with luck, only in an upwardly compatible way.
The definitions necessary to write new modules are publicly available
in the non-standard header gconv.h. The following text,
therefore, describes the definitions from this header file. First,
however, it is necessary to get an overview.
From the perspective of the user of iconv the interface is quite
simple: the iconv_open function returns a handle that can be used
in calls to iconv, and finally the handle is freed with a call to
iconv_close. The problem is that the handle has to be able to
represent the possibly long sequences of conversion steps and also the
state of each conversion since the handle is all that is passed to the
iconv function. Therefore, the data structures are really the
elements necessary to understanding the implementation.
We need two different kinds of data structures. The first describes the
conversion and the second describes the state etc. There are really two
type definitions like this in gconv.h.
— Data type: struct __gconv_step
This data structure describes one conversion a module can perform. For
each function in a loaded module with conversion functions there is
exactly one object of this type. This object is shared by all users of
the conversion (i.e., this object does not contain any information
corresponding to an actual conversion; it only describes the conversion
struct __gconv_loaded_object *__shlib_handle
const char *__modname
All these elements of the structure are used internally in the C library
to coordinate loading and unloading the shared. One must not expect any
of the other elements to be available or initialized.
const char *__from_name
const char *__to_name
__from_name and __to_name contain the names of the source and
destination character sets. They can be used to identify the actual
conversion to be carried out since one module might implement conversions
for more than one character set and/or direction.
These elements contain pointers to the functions in the loadable module.
The interface will be explained below.
These values have to be supplied in the init function of the module. The
__min_needed_from value specifies how many bytes a character of
the source character set at least needs. The __max_needed_from
specifies the maximum value that also includes possible shift sequences.
The __min_needed_to and __max_needed_to values serve the
same purpose as __min_needed_from and __max_needed_from but
this time for the destination character set.
It is crucial that these values be accurate since otherwise the
conversion functions will have problems or not work at all.
This element must also be initialized by the init function.
int __stateful is nonzero if the source character set is stateful.
Otherwise it is zero.
This element can be used freely by the conversion functions in the
module. void *__data can be used to communicate extra information
from one call to another. void *__data need not be initialized if
not needed at all. If void *__data element is assigned a pointer
to dynamically allocated memory (presumably in the init function) it has
to be made sure that the end function deallocates the memory. Otherwise
the application will leak memory.
It is important to be aware that this data structure is shared by all
users of this specification conversion and therefore the __data
element must not contain data specific to one specific use of the
— Data type: struct __gconv_step_data
This is the data structure that contains the information specific to
each use of the conversion functions.
These elements specify the output buffer for the conversion step. The
__outbuf element points to the beginning of the buffer, and
__outbufend points to the byte following the last byte in the
buffer. The conversion function must not assume anything about the size
of the buffer but it can be safely assumed the there is room for at
least one complete character in the output buffer.
Once the conversion is finished, if the conversion is the last step, the
__outbuf element must be modified to point after the last byte
written into the buffer to signal how much output is available. If this
conversion step is not the last one, the element must not be modified.
The __outbufend element must not be modified.
This element is nonzero if this conversion step is the last one. This
information is necessary for the recursion. See the description of the
conversion function internals below. This element must never be
The conversion function can use this element to see how many calls of
the conversion function already happened. Some character sets require a
certain prolog when generating output, and by comparing this value with
zero, one can find out whether it is the first call and whether,
therefore, the prolog should be emitted. This element must never be
This element is another one rarely used but needed in certain
situations. It is assigned a nonzero value in case the conversion
functions are used to implement mbsrtowcs et.al. (i.e., the
function is not used directly through the iconv interface).
This sometimes makes a difference as it is expected that the
iconv functions are used to translate entire texts while the
mbsrtowcs functions are normally used only to convert single
strings and might be used multiple times to convert entire texts.
But in this situation we would have problem complying with some rules of
the character set specification. Some character sets require a prolog,
which must appear exactly once for an entire text. If a number of
mbsrtowcs calls are used to convert the text, only the first call
must add the prolog. However, because there is no communication between the
different calls of mbsrtowcs, the conversion functions have no
possibility to find this out. The situation is different for sequences
of iconv calls since the handle allows access to the needed
The int __internal_use element is mostly used together with
__invocation_counter as follows:
The __statep element points to an object of type mbstate_t
(see Keeping the state). The conversion of a stateful character
set must use the object pointed to by __statep to store
information about the conversion state. The __statep element
itself must never be modified.
This element must never be used directly. It is only part of
this structure to have the needed space allocated.
184.108.40.206 iconv module interfaces
With the knowledge about the data structures we now can describe the
conversion function itself. To understand the interface a bit of
knowledge is necessary about the functionality in the C library that
loads the objects with the conversions.
It is often the case that one conversion is used more than once (i.e.,
there are several iconv_open calls for the same set of character
sets during one program run). The mbsrtowcs et.al. functions in
the GNU C library also use the iconv functionality, which
increases the number of uses of the same functions even more.
Because of this multiple use of conversions, the modules do not get
loaded exclusively for one conversion. Instead a module once loaded can
be used by an arbitrary number of iconv or mbsrtowcs calls
at the same time. The splitting of the information between conversion-
function-specific information and conversion data makes this possible.
The last section showed the two data structures used to do this.
This is of course also reflected in the interface and semantics of the
functions that the modules must provide. There are three functions that
must have the following names:
The gconv_init function initializes the conversion function
specific data structure. This very same object is shared by all
conversions that use this conversion and, therefore, no state information
about the conversion itself must be stored in here. If a module
implements more than one conversion, the gconv_init function will
be called multiple times.
The gconv_end function is responsible for freeing all resources
allocated by the gconv_init function. If there is nothing to do,
this function can be missing. Special care must be taken if the module
implements more than one conversion and the gconv_init function
does not allocate the same resources for all conversions.
This is the actual conversion function. It is called to convert one
block of text. It gets passed the conversion step information
initialized by gconv_init and the conversion data, specific to
this use of the conversion functions.
There are three data types defined for the three module interface
functions and these define the interface.
— Data type: int (*__gconv_init_fct) (struct __gconv_step *)
This specifies the interface of the initialization function of the
module. It is called exactly once for each conversion the module
As explained in the description of the struct __gconv_step data
structure above the initialization function has to initialize parts of
These elements must be initialized to the exact numbers of the minimum
and maximum number of bytes used by one character in the source and
destination character sets, respectively. If the characters all have the
same size, the minimum and maximum values are the same.
This element must be initialized to an nonzero value if the source
character set is stateful. Otherwise it must be zero.
If the initialization function needs to communicate some information
to the conversion function, this communication can happen using the
__data element of the __gconv_step structure. But since
this data is shared by all the conversions, it must not be modified by
the conversion function. The example below shows how this can be used.
#define MIN_NEEDED_FROM 1
#define MAX_NEEDED_FROM 4
#define MIN_NEEDED_TO 4
#define MAX_NEEDED_TO 4
gconv_init (struct __gconv_step *step)
/* Determine which direction. */
struct iso2022jp_data *new_data;
enum direction dir = illegal_dir;
enum variant var = illegal_var;
if (__strcasecmp (step->__from_name, "ISO-2022-JP//") == 0)
dir = from_iso2022jp;
var = iso2022jp;
else if (__strcasecmp (step->__to_name, "ISO-2022-JP//") == 0)
dir = to_iso2022jp;
var = iso2022jp;
else if (__strcasecmp (step->__from_name, "ISO-2022-JP-2//") == 0)
dir = from_iso2022jp;
var = iso2022jp2;
else if (__strcasecmp (step->__to_name, "ISO-2022-JP-2//") == 0)
dir = to_iso2022jp;
var = iso2022jp2;
result = __GCONV_NOCONV;
if (dir != illegal_dir)
new_data = (struct iso2022jp_data *)
malloc (sizeof (struct iso2022jp_data));
result = __GCONV_NOMEM;
if (new_data != NULL)
new_data->dir = dir;
new_data->var = var;
step->__data = new_data;
if (dir == from_iso2022jp)
step->__min_needed_from = MIN_NEEDED_FROM;
step->__max_needed_from = MAX_NEEDED_FROM;
step->__min_needed_to = MIN_NEEDED_TO;
step->__max_needed_to = MAX_NEEDED_TO;
step->__min_needed_from = MIN_NEEDED_TO;
step->__max_needed_from = MAX_NEEDED_TO;
step->__min_needed_to = MIN_NEEDED_FROM;
step->__max_needed_to = MAX_NEEDED_FROM + 2;
/* Yes, this is a stateful encoding. */
step->__stateful = 1;
result = __GCONV_OK;
The function first checks which conversion is wanted. The module from
which this function is taken implements four different conversions;
which one is selected can be determined by comparing the names. The
comparison should always be done without paying attention to the case.
Next, a data structure, which contains the necessary information about
which conversion is selected, is allocated. The data structure
struct iso2022jp_data is locally defined since, outside the
module, this data is not used at all. Please note that if all four
conversions this modules supports are requested there are four data
One interesting thing is the initialization of the __min_ and
__max_ elements of the step data object. A single ISO-2022-JP
character can consist of one to four bytes. Therefore the
MIN_NEEDED_FROM and MAX_NEEDED_FROM macros are defined
this way. The output is always the INTERNAL character set (aka
UCS-4) and therefore each character consists of exactly four bytes. For
the conversion from INTERNAL to ISO-2022-JP we have to take into
account that escape sequences might be necessary to switch the character
sets. Therefore the __max_needed_to element for this direction
gets assigned MAX_NEEDED_FROM + 2. This takes into account the
two bytes needed for the escape sequences to single the switching. The
asymmetry in the maximum values for the two directions can be explained
easily: when reading ISO-2022-JP text, escape sequences can be handled
alone (i.e., it is not necessary to process a real character since the
effect of the escape sequence can be recorded in the state information).
The situation is different for the other direction. Since it is in
general not known which character comes next, one cannot emit escape
sequences to change the state in advance. This means the escape
sequences that have to be emitted together with the next character.
Therefore one needs more room than only for the character itself.
The possible return values of the initialization function are:
The initialization succeeded
The requested conversion is not supported in the module. This can
happen if the gconv-modules file has errors.
Memory required to store additional information could not be allocated.
The function called before the module is unloaded is significantly
easier. It often has nothing at all to do; in which case it can be left
— Data type: void (*__gconv_end_fct) (struct gconv_step *)
The task of this function is to free all resources allocated in the
initialization function. Therefore only the __data element of
the object pointed to by the argument is of interest. Continuing the
example from the initialization function, the finalization function
looks like this:
The most important function is the conversion function itself, which can
get quite complicated for complex character sets. But since this is not
of interest here, we will only describe a possible skeleton for the
— Data type: int (*__gconv_fct) (struct __gconv_step *, struct __gconv_step_data *, const char **, const char *, size_t *, int)
The conversion function can be called for two basic reason: to convert
text or to reset the state. From the description of the iconv
function it can be seen why the flushing mode is necessary. What mode
is selected is determined by the sixth argument, an integer. This
argument being nonzero means that flushing is selected.
Common to both modes is where the output buffer can be found. The
information about this buffer is stored in the conversion step data. A
pointer to this information is passed as the second argument to this
function. The description of the struct __gconv_step_data
structure has more information on the conversion step data.
What has to be done for flushing depends on the source character set.
If the source character set is not stateful, nothing has to be done.
Otherwise the function has to emit a byte sequence to bring the state
object into the initial state. Once this all happened the other
conversion modules in the chain of conversions have to get the same
chance. Whether another step follows can be determined from the
__is_last element of the step data structure to which the first
The more interesting mode is when actual text has to be converted. The
first step in this case is to convert as much text as possible from the
input buffer and store the result in the output buffer. The start of the
input buffer is determined by the third argument, which is a pointer to a
pointer variable referencing the beginning of the buffer. The fourth
argument is a pointer to the byte right after the last byte in the buffer.
The conversion has to be performed according to the current state if the
character set is stateful. The state is stored in an object pointed to
by the __statep element of the step data (second argument). Once
either the input buffer is empty or the output buffer is full the
conversion stops. At this point, the pointer variable referenced by the
third parameter must point to the byte following the last processed
byte (i.e., if all of the input is consumed, this pointer and the fourth
parameter have the same value).
What now happens depends on whether this step is the last one. If it is
the last step, the only thing that has to be done is to update the
__outbuf element of the step data structure to point after the
last written byte. This update gives the caller the information on how
much text is available in the output buffer. In addition, the variable
pointed to by the fifth parameter, which is of type size_t, must
be incremented by the number of characters (not bytes) that were
converted in a non-reversible way. Then, the function can return.
In case the step is not the last one, the later conversion functions have
to get a chance to do their work. Therefore, the appropriate conversion
function has to be called. The information about the functions is
stored in the conversion data structures, passed as the first parameter.
This information and the step data are stored in arrays, so the next
element in both cases can be found by simple pointer arithmetic:
But this is not yet all. Once the function call returns the conversion
function might have some more to do. If the return value of the function
is __GCONV_EMPTY_INPUT, more room is available in the output
buffer. Unless the input buffer is empty the conversion, functions start
all over again and process the rest of the input buffer. If the return
value is not __GCONV_EMPTY_INPUT, something went wrong and we have
to recover from this.
A requirement for the conversion function is that the input buffer
pointer (the third argument) always point to the last character that
was put in converted form into the output buffer. This is trivially
true after the conversion performed in the current step, but if the
conversion functions deeper downstream stop prematurely, not all
characters from the output buffer are consumed and, therefore, the input
buffer pointers must be backed off to the right position.
Correcting the input buffers is easy to do if the input and output
character sets have a fixed width for all characters. In this situation
we can compute how many characters are left in the output buffer and,
therefore, can correct the input buffer pointer appropriately with a
similar computation. Things are getting tricky if either character set
has characters represented with variable length byte sequences, and it
gets even more complicated if the conversion has to take care of the
state. In these cases the conversion has to be performed once again, from
the known state before the initial conversion (i.e., if necessary the
state of the conversion has to be reset and the conversion loop has to be
executed again). The difference now is that it is known how much input
must be created, and the conversion can stop before converting the first
unused character. Once this is done the input buffer pointers must be
updated again and the function can return.
One final thing should be mentioned. If it is necessary for the
conversion to know whether it is the first invocation (in case a prolog
has to be emitted), the conversion function should increment the
__invocation_counter element of the step data structure just
before returning to the caller. See the description of the struct
__gconv_step_data structure above for more information on how this can
The return value must be one of the following values:
All input was consumed and there is room left in the output buffer.
No more room in the output buffer. In case this is not the last step
this value is propagated down from the call of the next conversion
function in the chain.
The input buffer is not entirely empty since it contains an incomplete
The following example provides a framework for a conversion function.
In case a new conversion has to be written the holes in this
implementation have to be filled and that is it.
gconv (struct __gconv_step *step, struct __gconv_step_data *data,
const char **inbuf, const char *inbufend, size_t *written,
struct __gconv_step *next_step = step + 1;
struct __gconv_step_data *next_data = data + 1;
gconv_fct fct = next_step->__fct;
/* If the function is called with no input this means we haveto reset to the initial state. The possibly partlyconverted input is dropped. */
status = __GCONV_OK;
/* Possible emit a byte sequence which put the state objectinto the initial state. */
/* Call the steps down the chain if there are any but onlyif we successfully emitted the escape sequence. */
if (status == __GCONV_OK && ! data->__is_last)
status = fct (next_step, next_data, NULL, NULL,
/* We preserve the initial values of the pointer variables. */
const char *inptr = *inbuf;
char *outbuf = data->__outbuf;
char *outend = data->__outbufend;
/* Remember the start value for this round. */
inptr = *inbuf;
/* The outbuf buffer is empty. */
outptr = outbuf;
/* For stateful encodings the state must be safe here. */
/* Run the conversion loop. status is setappropriately afterwards. */
/* If this is the last step, leave the loop. There isnothing we can do. */
/* Store information about how many bytes areavailable. */
data->__outbuf = outbuf;
/* If any non-reversible conversions were performed,add the number to *written. */
/* Write out all output that was produced. */
if (outbuf > outptr)
const char *outerr = data->__outbuf;
result = fct (next_step, next_data, &outerr,
outbuf, written, 0);
if (result != __GCONV_EMPTY_INPUT)
if (outerr != outbuf)
/* Reset the input buffer pointer. Wedocument here the complex case. */
/* Reload the pointers. */
*inbuf = inptr;
outbuf = outptr;
/* Possibly reset the state. */
/* Redo the conversion, but this timethe end of the output buffer is atouterr. */
/* Change the status. */
status = result;
/* All the output is consumed, we can make another run if everything was ok. */
if (status == __GCONV_FULL_OUTPUT)
status = __GCONV_OK;
while (status == __GCONV_OK);
/* We finished one use of this step. */
This information should be sufficient to write new modules. Anybody
doing so should also take a look at the available source code in the GNU
C library sources. It contains many examples of working and optimized
Published under the terms of the GNU General Public License