Follow Techotopia on Twitter

On-line Guides
All Guides
eBook Store
iOS / Android
Linux for Beginners
Office Productivity
Linux Installation
Linux Security
Linux Utilities
Linux Virtualization
Linux Kernel
System/Network Admin
Programming
Scripting Languages
Development Tools
Web Development
GUI Toolkits/Desktop
Databases
Mail Systems
openSolaris
Eclipse Documentation
Techotopia.com
Virtuatopia.com

How To Guides
Virtualization
General System Admin
Linux Security
Linux Filesystems
Web Servers
Graphics & Desktop
PC Hardware
Windows
Problem Solutions

  




 

 

10.9.1. Unicode Character Sets

MySQL has two Unicode character sets. You can store text in about 650 languages using these character sets.

  • ucs2 (UCS-2 Unicode) collations:

    • ucs2_bin

    • ucs2_czech_ci

    • ucs2_danish_ci

    • ucs2_esperanto_ci

    • ucs2_estonian_ci

    • ucs2_general_ci (default)

    • ucs2_hungarian_ci

    • ucs2_icelandic_ci

    • ucs2_latvian_ci

    • ucs2_lithuanian_ci

    • ucs2_persian_ci

    • ucs2_polish_ci

    • ucs2_roman_ci

    • ucs2_romanian_ci

    • ucs2_slovak_ci

    • ucs2_slovenian_ci

    • ucs2_spanish2_ci

    • ucs2_spanish_ci

    • ucs2_swedish_ci

    • ucs2_turkish_ci

    • ucs2_unicode_ci

  • utf8 (UTF-8 Unicode) collations:

    • utf8_bin

    • utf8_czech_ci

    • utf8_danish_ci

    • utf8_esperanto_ci

    • utf8_estonian_ci

    • utf8_general_ci (default)

    • utf8_hungarian_ci

    • utf8_icelandic_ci

    • utf8_latvian_ci

    • utf8_lithuanian_ci

    • utf8_persian_ci

    • utf8_polish_ci

    • utf8_roman_ci

    • utf8_romanian_ci

    • utf8_slovak_ci

    • utf8_slovenian_ci

    • utf8_spanish2_ci

    • utf8_spanish_ci

    • utf8_swedish_ci

    • utf8_turkish_ci

    • utf8_unicode_ci

The ucs2_hungarian_ci and utf8_hungarian_ci collations were added in MySQL 5.1.5.

MySQL implements the utf8_unicode_ci collation according to the Unicode Collation Algorithm (UCA) described at http://www.unicode.org/reports/tr10/. The collation uses the version-4.0.0 UCA weight keys: http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt. The following discussion uses utf8_unicode_ci, but it is also true for ucs2_unicode_ci.

Currently, the utf8_unicode_ci collation has only partial support for the Unicode Collation Algorithm. Some characters are not supported yet. Also, combining marks are not fully supported. This affects primarily Vietnamese and some minority languages in Russia such as Udmurt, Tatar, Bashkir, and Mari.

The most significant feature in utf8_unicode_ci is that it supports expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages ‘ß’ is equal to ‘ss’.

utf8_general_ci is a legacy collation that does not support expansions. It can make only one-to-one comparisons between characters. This means that comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci.

For example, the following equalities hold in both utf8_general_ci and utf8_unicode_ci:

Ä = A
Ö = O
Ü = U

A difference between the collations is that this is true for utf8_general_ci:

ß = s

Whereas this is true for utf8_unicode_ci:

ß = ss

MySQL implements language-specific collations for the utf8 character set only if the ordering with utf8_unicode_ci does not work well for a language. For example, utf8_unicode_ci works fine for German and French, so there is no need to create special utf8 collations for these two languages.

utf8_general_ci also is satisfactory for both German and French, except that ‘ß’ is equal to ‘s’, and not to ‘ss’. If this is acceptable for your application, then you should use utf8_general_ci because it is faster. Otherwise, use utf8_unicode_ci because it is more accurate.

utf8_swedish_ci, like other utf8 language-specific collations, is derived from utf8_unicode_ci with additional language rules. For example, in Swedish, the following relationship holds, which is not something expected by a German or French speaker:

Ü = Y < Ö

The utf8_spanish_ci and utf8_spanish2_ci collations correspond to modern Spanish and traditional Spanish, respectively. In both collations, ‘ñ’ (n-tilde) is a separate letter between ‘n’ and ‘o’. In addition, for traditional Spanish, ‘ch’ is a separate letter between ‘c’ and d, and ‘ll’ is a separate letter between ‘l’ and ‘m


 
 
  Published under the terms of the GNU General Public License Design by Interspire