Google
Information Storage and Retrieval: January 2011

Pages

Tuesday, January 18, 2011

Character Sets and Code Pages

In computer science, the terms character encoding, character map, character set or code page were historically synonymous.

A character set is an agreement on what numeric value, a symbol has. A computer does not understand 'A' or 'B' , it only knows numeric(binary) values of a symbol, defined in the character set used by its Operating system. A computer only 'understands' numbers, hence there is a need of character sets.

ASCII is a 7-bit character set. So, it knows only 128 (2^7) symbols. 'UTF-8' a Unicode multibyte characterset (UTF8/AL32UTF8 in Oracle).

Code page is another name for character encoding. It consists of a table of values that describes the character set for a particular language. Vendors often allocate their own code page number to a character encoding, even if it is better known by another name (for example UTF-8 character encoding has code page numbers 1208 at IBM, 65001 at Microsoft, 4110 at SAP)