10.9.1. Unicode Character Sets
MySQL has two Unicode character sets. You can store text in
about 650 languages using these character sets.
The ucs2_hungarian_ci and
utf8_hungarian_ci collations were added in
MySQL 5.1.5.
MySQL implements the utf8_unicode_ci
collation according to the Unicode Collation Algorithm (UCA)
described at
https://www.unicode.org/reports/tr10/. The
collation uses the version-4.0.0 UCA weight keys:
https://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.
The following discussion uses
utf8_unicode_ci, but it is also true for
ucs2_unicode_ci.
Currently, the utf8_unicode_ci collation has
only partial support for the Unicode Collation Algorithm. Some
characters are not supported yet. Also, combining marks are not
fully supported. This affects primarily Vietnamese and some
minority languages in Russia such as Udmurt, Tatar, Bashkir, and
Mari.
The most significant feature in
utf8_unicode_ci is that it supports
expansions; that is, when one character compares as equal to
combinations of other characters. For example, in German and
some other languages ‘ß’ is
equal to ‘ss’.
utf8_general_ci is a legacy collation that
does not support expansions. It can make only one-to-one
comparisons between characters. This means that comparisons for
the utf8_general_ci collation are faster, but
slightly less correct, than comparisons for
utf8_unicode_ci.
For example, the following equalities hold in both
utf8_general_ci and
utf8_unicode_ci:
Ä = A
Ö = O
Ü = U
A difference between the collations is that this is true for
utf8_general_ci:
ß = s
Whereas this is true for utf8_unicode_ci:
ß = ss
MySQL implements language-specific collations for the
utf8 character set only if the ordering with
utf8_unicode_ci does not work well for a
language. For example, utf8_unicode_ci works
fine for German and French, so there is no need to create
special utf8 collations for these two
languages.
utf8_general_ci also is satisfactory for both
German and French, except that
‘ß’ is equal to
‘s’, and not to
‘ss’. If this is acceptable for
your application, then you should use
utf8_general_ci because it is faster.
Otherwise, use utf8_unicode_ci because it is
more accurate.
utf8_swedish_ci, like other
utf8 language-specific collations, is derived
from utf8_unicode_ci with additional language
rules. For example, in Swedish, the following relationship
holds, which is not something expected by a German or French
speaker:
Ü = Y < Ö
The utf8_spanish_ci and
utf8_spanish2_ci collations correspond to
modern Spanish and traditional Spanish, respectively. In both
collations, ‘ñ’ (n-tilde) is a
separate letter between ‘n’ and
‘o’. In addition, for traditional
Spanish, ‘ch’ is a separate
letter between ‘c’ and
d, and ‘ll’ is
a separate letter between ‘l’ and
‘m’