Introduction to character sets and collations

User's Guide
PART 1. Working with Databases
CHAPTER 12. Database Collations and International Languages

Introduction to character sets and collations

When you create a database, you specify a collating sequence or collation to be used by the database. A collation is a character set, and a sorting order for characters in the database. Whenever the database compares strings, sorts strings, or carries out other string operations such as case conversion, it does so using the collating sequence. The database carries out sorting and string comparison when statements such as the following are submitted:

Queries with an ORDER BY clause.
Expressions that use string functions, such as LOCATE, SIMILAR, SOUNDEX.
Conditions using the LIKE keyword.

The database also uses character sets in identifiers (column names and so on). In deciding whether a string is a valid or unique identifier, the database uses the database collation.

Single-byte character sets and code pages

Many languages have few enough characters to be represented in a single-byte character set. In such a character set, each character is represented by a single byte: a two-digit hexadecimal number.

At most, 256 characters can be represented in a single-byte. No single-byte character set can hold all of the characters used internationally, including accented characters. IBM solved this problem by developing a set of code pages, each of which describes a set of characters appropriate for one or more national languages. For example, code page 869 contains the Greek character set, and code page 850 contains an international character set suitable for representing many characters in a variety of languages.

Adaptive Server Anywhere supplies a set of single-byte collations (code pages and collation orderings) suitable for many languages of European origin.

Upper and lower pages

With few exceptions, characters 0 to 127 are the same for all the single-byte code pages. The mapping for this range of characters is called the ASCII character set. It includes the English language alphabet in upper and lower case, as well as common punctuation symbols and the digits. This range is often called the seven-bit range (because only seven bits are needed) or the lower page. The characters from 128 to 256 are called extended characters, or upper code-page characters, and vary from code page to code page.

There is generally no problem with code page compatibility if the only characters used are from the English alphabet, as these are represented in the ASCII portion of each code page (0 to 127). However, if other characters are used, as is generally the case in any non-English environment, there can be problems if the database and the application use different code pages.

Example

Suppose a database holding French language strings uses code page 850, and the client operating system uses code page 437. The character (upper case A grave) is held in the database as character 183 (hexadecimal B7). In code page 437, character 183 is a graphical character. The client application receives this byte and the operating system displays it on the screen, the user sees a graphical character instead of an A grave.

Code pages in Windows and Windows NT

For PC users, the issue is complicated because there are at least two code pages in use on most PC's. MS-DOS, as well as character-mode applications (those using the console or "DOS box") in Windows 95 and Windows NT, use code pages taken from the IBM set. These are called OEM code pages (Original Equipment Manufacturer) for historical reasons.

Windows operating systems do not require the line drawing characters that were held in the extended characters of the OEM code pages, so they use a different set of code pages. These pages were based on the ANSI standard and are therefore commonly called ANSI code pages.

Adaptive Server Anywhere supports collations based on both OEM and ANSI code pages.

Example

Consider the following situation:

A PC is running the Windows 95 operating system with ANSI code page 1252.
The code page for character-mode applications is OEM code page 437.
Text is held in a database created using the collation corresponding to OEM code page 850.

An upper case A grave in the database is stored as character 183. This value is displayed as a graphical character in a character-mode application. The same character is displayed as a dot on a Windows application.

For information about choosing a single-byte collation for your database, see Choosing a database collation.

Multibyte character sets

Some languages, such as Japanese and Chinese, have many more than 256 characters. These characters cannot all be represented using a single byte, but can be represented in multibyte character sets. In addition, some character sets use the much larger number of characters available in a multibyte representation to represent characters from many languages in a single, more comprehensive, character set.

Multibyte character sets are of two types. Some are variable width, in which some characters are single-byte characters, others are double-byte, and so on. Other sets are fixed width, in which all characters in the set have the same number of bytes. Adaptive Server Anywhere supports only variable-width character sets.

For information on the multibyte character sets, see Using multibyte collations.

Displaying your current character settings

Each operating system has its own system for handling character sets, encodings, and collation sequences. To find out information about the current settings on your operating system, you can:

At a system command prompt, type chcp to display the current code page. On Windows and Windows NT PCs, this returns the OEM code page.
In Windows 3.x, Windows 95, and Windows NT, see the Regional Settings in the Control Panel. The Regional settings correspond to an ANSI code page.

Sorting characters using collations

A collation is a sorting order for characters in a character set encoding or code page. The collation sequence is based on the encoded value of the characters.

The collation sequence includes the notion of alphabetic ordering of letters, and extends it to include all characters in the character set, including digits and space characters.

Associating more than one character with each sort position

More than one character can be associated with each sort position. This is useful if you wish, for example, to treat an accented character the same as the character without an accent.

Two characters with the same sort position are considered identical in all ways by the database. Therefore, if a collation assigned the characters a and e to the same sort order, then a query with the following search condition:

WHERE col1 = 'want'.

is satisfied by a row for which col1 contains the entry "went".

At each sort position, lower- and uppercase forms of a character can be indicated. For case-sensitive databases, the lower- and uppercase characters are not treated as equivalent. For case-insensitive databases, the lower- and uppercase versions of the character are considered equivalent.