Creating databases with custom collations

User's Guide
PART 1. Working with Databases
CHAPTER 12. Database Collations and International Languages

Creating databases with custom collations

You can create a database using a collation different from the supplied collations. This section describes how to build databases using such a custom collation.

Steps to create a database

To create a database with a custom collation:

Decide on a starting collation You should choose a collation as close as possible to the one you want to create as a starting point for your custom collation.

For a listing of supplied collations, see Supplied collations. Alternatively, run dbinit with the -l (lower case L) option:
```
dbinit -l
```
Create a custom collation file You do this using the Collation utility. The output is a collation file.

For example, the following statement extracts the collation from the default database into a file named mycol.col:
```
dbcollat -c " uid=dba;pwd=sql" mycol.col
```
When you use the Collation utility to extract a collation from an existing database, the database does not need to contain any information. If you do not currently have a database using the collation on which you want to base a custom collation, you can create such a database using the Initialization utility.
Edit the custom collation file Make the changes you wish in the custom collation file, and provide a name for your collation. You can do this using any text editor.

The name of the collation is specified on a line near the top of the file, starting with Collation. You should edit this line to provide a new name.
Convert the file to a SQL script You can do this using the dbcollat command-line utility using the -d switch. This step was not required in previous releases.

For example, the following command line creates the mycol.sql file from the mycol.col collation file:
```
dbcollat -d mycol.col custmap.sql mycol.sql
```
Add the custom collation to the custom.sql script The custom.sql script is stored in the scripts subdirectory of your installation directory. This step was not required in previous releases.
Create the new database You do this using the Initialization utility, specifying the name of the custom collation. In previous releases, you would have specified the collation file name.

For example, the following command line creates a database named newcol.db using the custom collation sequence newcol.
```
dbinit -z newcol temp.db
```

First-byte collation orderings

A sorting order for characters in a multibyte character set can be specified only for the first byte. Characters that have the same first byte are sorted according to the hexadecimal value of the following bytes. If the two characters are the same up to the length of the shorter of the two, the longer character is greater than the shorter.

Editing the collation file

This section describes the collation file format. Collation files may include the following elements:

Comment lines, which are ignored by the database.
A title line.
A collation sequence section.
An Encodings section (multibyte character sets only).
A Properties section (multibyte character sets only).

Comment lines

In the collation file, spaces are generally ignored. Comment lines start with either % or --.

The title line

The first non-comment line must be of the form:

Collation label (name)

In this statement:

Item

Description

Collation

A required keyword.

label

The collation label, which appears in the system tables as SYS.SYSCOLLATION.collation_label and SYS.SYSINFO.default_collation. The label must contain no more than 10 characters, and must not be the same as one of the built-in collations. (In particular, do not leave the collation label unchanged.)

name

A descriptive term, used for documentation purposes. The name should contain no more than 128 characters

Item	Description
Collation	A required keyword.
label	The collation label, which appears in the system tables as SYS.SYSCOLLATION.collation_label and SYS.SYSINFO.default_collation. The label must contain no more than 10 characters, and must not be the same as one of the built-in collations. (In particular, do not leave the collation label unchanged.)
name	A descriptive term, used for documentation purposes. The name should contain no more than 128 characters

For example, the Shift-JIS collation file contains the following collation line, with label SJIS and name (Japanese Shift-JIS Encoding):

Collation SJIS (Japanese Shift-JIS Encoding)

The collation sequence section

After the title line, each non-comment line describes one position in the collation. The ordering of the lines determines the sort ordering used by the database, and determines the result of comparisons. Characters on lines appearing higher in the file (closer to the beginning) sort before characters that appear later.

The form of each line in the sequence is:

[sort-position] : character

[sort-position] : character [lowercase uppercase]

where:

Descriptions of arguments

Argument

Description

sort-position

Optional. Specifies the position at which the characters on that line will sort. Smaller numbers represent a lesser value, so will sort closer to the beginning of the sorted item. Typically, the sort-position is omitted, and the characters sort immediately following the characters from the previous sort position

character

The character whose sort-position is being specified

lowercase

Optional. Specifies the lowercase equivalent of the character. If not specified, the character has no lowercase equivalent

uppercase

Optional. Specifies the uppercase equivalent of the character. If not specified, the character has no uppercase equivalent

Argument	Description
sort-position	Optional. Specifies the position at which the characters on that line will sort. Smaller numbers represent a lesser value, so will sort closer to the beginning of the sorted item. Typically, the sort-position is omitted, and the characters sort immediately following the characters from the previous sort position
character	The character whose sort-position is being specified
lowercase	Optional. Specifies the lowercase equivalent of the character. If not specified, the character has no lowercase equivalent
uppercase	Optional. Specifies the uppercase equivalent of the character. If not specified, the character has no uppercase equivalent

Multiple characters may appear on one line, separated by commas (,). In this case, these characters are sorted and compared as if they were the same character. You need to specify all three forms of the first character, then a comma, then all three forms of the second character, and so on.

Specifying character and sort-position

Each character and sort position is specified in one of the following ways:

Specification

Description

\dnnn

Decimal number, using digits 0-9 (such as \d001)

\xhh

Hexadecimal number, using digits 0-9 and letters a-f or A-F (such as \xB4)

'c'

Any character in place of c (such as ',')

c

Any character other than quote ('), back-slash (\), colon (:) or comma (,). These characters must use one of the previous forms.

Specification	Description
\dnnn	Decimal number, using digits 0-9 (such as \d001)
\xhh	Hexadecimal number, using digits 0-9 and letters a-f or A-F (such as \xB4)
'c'	Any character in place of c (such as ',')
c	Any character other than quote ('), back-slash (\), colon (:) or comma (,). These characters must use one of the previous forms.

The following are some sample lines for a collation:

% Sort some letters in alphabetical order
: A a A
: a a A
: B b B
: b b B
% Sort some E's from code page 850,
% including some accented extended characters:
: e e E, \x82 \x82 \x90, \x8A \x8A \xD4
: E e E, \x90 \x82 \x90, \xD4 \x8A \xD4
% Sort some special characters at the end:
: ' '
: _
: \xF2
: \xEE
: \xF0
: -
: ','
: ;
: ':'
: !

Other syntax notes

For databases using case-insensitive sorting and comparison (no -c specified on the DBINIT command line), the lower case and upper case mappings are used to find the lower case and upper case characters that will be sorted together.

For multibyte character sets, the first byte of a character is listed in the collation sequence, and all characters with the same first byte are sorted together, and ordered according to the value of the second byte. For example, the following is part of a Shift-JIS collation file:

:   \xfb
:   \xfc
:   \xfd

In this collation, all characters with first byte \xfc come after all characters with first byte \xfb and before all characters with first byte \xfd. The two-byte character \xfc \x01 would be ordered before the two-byte character \xfc \x02.

Any characters omitted from the collation will be added to the collation at the position equal to their binary value. DBINIT issues a message for each omitted character. It is recommended that any collation contain all 256 characters (first bytes).

The Encodings section

The Encodings section is optional, and follows the collation sequence. It is not useful for single-byte character sets.

The Encodings section lists those combinations of bytes which are valid characters. The format of the section may be described by example.

The Shift-JIS Encodings section is as follows:

Encodings:
[\x00-\x80,\xa0-\xdf,\xf0-\xff]
[\x81-\x9f,\xe0-\xef][\x00-\xff]

The first line following the section title lists valid single-byte characters. The square brackets enclose a comma-separated list of ranges. Each range is listed as a hyphen-separated pair of values. In the Shift-JIS collation, values \x00 to \x80 are valid single-byte characters, but \x81 is not a valid single-byte character.

The second line following the section title lists valid multibyte characters. Any combination of one byte from the second line followed by one byte from the first is a valid character. Therefore \x81\x00 is a valid double-byte character, but \xd0 \x00 is not.

The Properties section

The Properties section is optional, and follows the Encodings section. It is not useful for single-byte character sets.

If a Properties section is supplied, an Encodings section must be supplied also.

The Properties section lists values for the first-byte of each character that represent alphabetic characters, digits, or spaces.

The Shift-JIS Properties section is as follows:

Properties:
space: [\x09-\x0d,\x20]
digit: [\x30-\x39]
alpha: [\x41-\x5a,\x61-\x7a,\x81-\x9f,\xe0-\xef]

This indicates that characters with first bytes \x09 to \x0d, as well as \x20, are to be treated as space characters, digits are found in the range \x30 to \x39 inclusive, and alphabetic characters in the four ranges \x41-\x5a, \x61-\x7a, \x81-\x9f, and \xe0-\xef.