MySQL 5.1 supports two character sets for storing
ucs2, the UCS-2 Unicode character set.
utf8, the UTF-8 encoding of the Unicode
In UCS-2 (binary Unicode representation), every character is
represented by a two-byte Unicode code with the most significant
byte first. For example:
LATIN CAPITAL LETTER A
has the code
0x0041 and it is stored as a
SMALL LETTER YERU (Unicode
stored as a two-byte sequence:
0x04 0x4B. For
Unicode characters and their codes, please refer to the
Unicode Home Page.
Currently, UCS-2 cannot be used as a client character set, which
SET NAMES 'ucs2' does not work.
The UTF-8 character set (transform Unicode representation) is an
alternative way to store Unicode data. It is implemented according
to RFC 3629. The idea of the UTF-8 character set is that various
Unicode characters are encoded using byte sequences of different
Basic Latin letters, digits, and punctuation signs use one
Most European and Middle East script letters fit into a
two-byte sequence: extended Latin letters (with tilde, macron,
acute, grave and other accents), Cyrillic, Greek, Armenian,
Hebrew, Arabic, Syriac, and others.
Korean, Chinese, and Japanese ideographs use three-byte
RFC 3629 describes encoding sequences that take from one to four
bytes. Currently, MySQL support for UTF-8 does not include
four-byte sequences. (An older standard for UTF-8 encoding is
given by RFC 2279, which describes UTF-8 sequences that take from
one to six bytes. RFC 3629 renders RFC 2279 obsolete; for this
reason, sequences with five and six bytes are no longer used.)
Tip: To save space with UTF-8,
VARCHAR instead of
Otherwise, MySQL must reserve three bytes for each character in a
CHAR CHARACTER SET utf8 column because that is
the maximum possible length. For example, MySQL must reserve 30
bytes for a
CHAR(10) CHARACTER SET utf8 column.