This section is a quick summary of string concepts for beginning C
programmers. It describes how character strings are represented in C
and some common pitfalls. If you are already familiar with this
material, you can skip this section.
A string is an array of char objects. But string-valued
variables are usually declared to be pointers of type char *.
Such variables do not include space for the text of a string; that has
to be stored somewhere else—in an array variable, a string constant,
or dynamically allocated memory (see Memory Allocation). It's up to
you to store the address of the chosen memory space into the pointer
variable. Alternatively you can store a null pointer in the
pointer variable. The null pointer does not point anywhere, so
attempting to reference the string it points to gets an error.
“string” normally refers to multibyte character strings as opposed to
wide character strings. Wide character strings are arrays of type
wchar_t and as for multibyte character strings usually pointers
of type wchar_t * are used.
By convention, a null character, '\0', marks the end of a
multibyte character string and the null wide character,
L'\0', marks the end of a wide character string. For example, in
testing to see whether the char * variable p points to a
null character marking the end of a string, you can write
!*p or *p == '\0'.
A null character is quite different conceptually from a null pointer,
although both are represented by the integer 0.
String literals appear in C program source as strings of
characters between double-quote characters (`"') where the initial
double-quote character is immediately preceded by a capital `L'
(ell) character (as in L"foo"). In ISO C, string literals
can also be formed by string concatenation: "a" "b" is the
same as "ab". For wide character strings one can either use
L"a" L"b" or L"a" "b". Modification of string literals is
not allowed by the GNU C compiler, because literals are placed in
Character arrays that are declared const cannot be modified
either. It's generally good style to declare non-modifiable string
pointers to be of type const char *, since this often allows the
C compiler to detect accidental modifications as well as providing some
amount of documentation about what your program intends to do with the
The amount of memory allocated for the character array may extend past
the null character that normally marks the end of the string. In this
document, the term allocated size is always used to refer to the
total amount of memory allocated for the string, while the term
length refers to the number of characters up to (but not
including) the terminating null character.
A notorious source of program bugs is trying to put more characters in a
string than fit in its allocated size. When writing code that extends
strings or moves characters into a pre-allocated array, you should be
very careful to keep track of the length of the text and make explicit
checks for overflowing the array. Many of the library functions
do not do this for you! Remember also that you need to allocate
an extra byte to hold the null character that marks the end of the
Originally strings were sequences of bytes where each byte represents a
single character. This is still true today if the strings are encoded
using a single-byte character encoding. Things are different if the
strings are encoded using a multibyte encoding (for more information on
encodings see Extended Char Intro). There is no difference in
the programming interface for these two kind of strings; the programmer
has to be aware of this and interpret the byte sequences accordingly.
But since there is no separate interface taking care of these
differences the byte-based string functions are sometimes hard to use.
Since the count parameters of these functions specify bytes a call to
strncpy could cut a multibyte character in the middle and put an
incomplete (and therefore unusable) byte sequence in the target buffer.
To avoid these problems later versions of the ISO C standard
introduce a second set of functions which are operating on wide
characters (see Extended Char Intro). These functions don't have
the problems the single-byte versions have since every wide character is
a legal, interpretable value. This does not mean that cutting wide
character strings at arbitrary points is without problems. It normally
is for alphabet-based languages (except for non-normalized text) but
languages based on syllables still have the problem that more than one
wide character is necessary to complete a logical unit. This is a
higher level problem which the C library functions are not designed
to solve. But it is at least good that no invalid byte sequences can be
created. Also, the higher level functions can also much easier operate
on wide character than on multibyte characters so that a general advise
is to use wide characters internally whenever text is more than simply
The remaining of this chapter will discuss the functions for handling
wide character strings in parallel with the discussion of the multibyte
character strings since there is almost always an exact equivalent
Published under the terms of the GNU General Public License