Follow Techotopia on Twitter

On-line Guides
All Guides
eBook Store
iOS / Android
Linux for Beginners
Office Productivity
Linux Installation
Linux Security
Linux Utilities
Linux Virtualization
Linux Kernel
System/Network Admin
Programming
Scripting Languages
Development Tools
Web Development
GUI Toolkits/Desktop
Databases
Mail Systems
openSolaris
Eclipse Documentation
Techotopia.com
Virtuatopia.com
Answertopia.com

How To Guides
Virtualization
General System Admin
Linux Security
Linux Filesystems
Web Servers
Graphics & Desktop
PC Hardware
Windows
Problem Solutions
Privacy Policy

  




 

 

10.9.7.1. The cp932 Character Set

Why is cp932 needed?

In MySQL, the sjis character set corresponds to the Shift_JIS character set defined by IANA, which supports JIS X0201 and JIS X0208 characters. (See https://www.iana.org/assignments/character-sets.)

However, the meaning of “SHIFT JIS” as a descriptive term has become very vague and it often includes the extensions to Shift_JIS that are defined by various vendors.

For example, “SHIFT JIS” used in Japanese Windows environments is a Microsoft extension of Shift_JIS and its exact name is Microsoft Windows Codepage : 932 or cp932. In addition to the characters supported by Shift_JIS, cp932 supports extension characters such as NEC special characters, NEC selected — IBM extended characters, and IBM extended characters.

Many Japanese users have experienced problems using these extension characters. These problems stem from the following factors:

  • MySQL automatically converts character sets.

  • Character sets are converted via Unicode (ucs2).

  • The sjis character set does not support the conversion of these extension characters.

  • There are several conversion rules from so-called “SHIFT JIS” to Unicode, and some characters are converted to Unicode differently depending on the conversion rule. MySQL supports only one of these rules (described later).

The MySQL cp932 character set is designed to solve these problems.

Because MySQL supports character set conversion, it is important to separate IANA Shift_JIS and cp932 into two different character sets because they provide different conversion rules.

How does cp932 differ from sjis?

The cp932 character set differs from sjis in the following ways:

  • cp932 supports NEC special characters, NEC selected — IBM extended characters, and IBM selected characters.

  • Some cp932 characters have two different code points, both of which convert to the same Unicode code point. When converting from Unicode back to cp932, one of the code points must be selected. For this “round trip conversion,” the rule recommended by Microsoft is used. (See https://support.microsoft.com/kb/170559/EN-US/.)

    The conversion rule works like this:

    • If the character is in both JIS X 0208 and NEC special characters, use the code point of JIS X 0208.

    • If the character is in both NEC special characters and IBM selected characters, use the code point of NEC special characters.

    • If the character is in both IBM selected characters and NEC selected — IBM extended characters, use the code point of IBM extended characters.

    The table shown at https://www.microsoft.com/globaldev/reference/dbcs/932.htm provides information about the Unicode values of cp932 characters. For cp932 table entries with characters under which a four-digit number appears, the number represents the corresponding Unicode (ucs2) encoding. For table entries with an underlined two-digit value appears, there is a range of cp932 character values that begin with those two digits. Clicking such a table entry takes you to a page that displays the Unicode value for each of the cp932 characters that begin with those digits.

    The following links are of special interest. They correspond to the encodings for the following sets of characters:

  • cp932 supports conversion of user-defined characters in combination with eucjpms, and solves the problems with sjis/ujis conversion. For details, please refer to https://www.opengroup.or.jp/jvc/cde/sjis-euc-e.html.

  • For some characters, conversion to and from ucs2 is different for sjis and cp932. The following tables illustrate these differences.

    Conversion to ucs2:

    sjis/cp932 Value sjis -> ucs2 Conversion cp932 -> ucs2 Conversion
    5C 005C 005C
    7E 007E 007E
    815C 2015 2015
    815F 005C FF3C
    8160 301C FF5E
    8161 2016 2225
    817C 2212 FF0D
    8191 00A2 FFE0
    8192 00A3 FFE1
    81CA 00AC FFE2

    Conversion from ucs2:

    ucs2 value ucs2 -> sjis Conversion ucs2 -> cp932 Conversion
    005C 815F 5C
    007E 7E 7E
    00A2 8191 3F
    00A3 8192 3F
    00AC 81CA 3F
    2015 815C 815C
    2016 8161 3F
    2212 817C 3F
    2225 3F 8161
    301C 8160 3F
    FF0D 3F 817C
    FF3C 3F 815F
    FF5E 3F 8160
    FFE0 3F 8191
    FFE1 3F 8192
    FFE2 3F 81CA

 
 
  Published under the terms of the GNU General Public License Design by Interspire