BIP KB:
Mechanism of Unicode Collation Algorithm

Article By jeevan01


img

Unicode Linux - BIP media

4. Mechanism of Unicode Collation Algorithm

The Algorithm takes as input a Unicode string and a Collation Element Table and returns a sort key. This sort key can then be compared with any other sort key , to give the correct comparison between the strings for which they were generated. Design of Unicode collation algorithm allows implementations to produce culturally acceptable collation, while putting the least burden on implementations in terms of memory requirements and performance. In particular, Collation Element Tables only require storage of 32 bits of collation data per significant character.



However, implementations of the Unicode Collation Algorithm are not limited to supporting only three levels. They are free to support a fully customizable 4th level (or more levels), as long as they can produce the same results as the basic algorithm, given the right Collation Element Tables. For example, an application which uses the algorithm, but which must treat some collection of special characters as ignorable at the first three levels and must have those specials collate in non-Unicode order (as, for example to emulate an existing EBCDIC-based collation), may choose to have a fully customizable 4th level. The downside of this choice is that such an application will require more storage, both for the Collation Element Table and in constructed sort keys. The Collation Element Table may be tailored to produce particular culturally required orderings for different languages or locales. As in the algorithm itself, the tailoring can provide full customization for three (or more) levels.

Mechanism of Unicode Collation Algorithm Mainly, there are four steps in this process.

4.1. Normalize

Step 1. Produce a normalized form of each input string.

S1.1 Use the Unicode canonical algorithm to decompose characters according to the canonical mappings. That is, put the string into Normalization Form D
Consideration on Decomposition, Rearrangement and removal of “ignorable on all level characters” is also significant.

4.2. Produce Collation elements array

Step 2. The collation element array is built by sequencing through the normalized form as follows:

S2.1 Find the longest initial substring S at each point that has a match in the table.
S2.1.1 If there are any non-starters following S, process each non-starter C.
S2.1.2 If C is not blocked from S, find if S + C has a match in the table.
S2.1.3 If there is a match, replace S by S + C, and remove C.
S2.2 Fetch the corresponding collation element(s) from the table if there is a match. If there is no match, synthesize a weight
S2.3 Process collation elements according to the variable-weight setting
S2.4 Append the collation element(s) to the collation element array.
S2.5 Proceed to the next point in the string (past S).
S2.6 Loop until the end of the string is reached.

4.3. Form sort key

Step 3. The sort key is formed by successively appending weights from the collation element array. The weights are appended from each level in turn, from 1 to 3. (Backwards weights are inserted in reverse order.)

An implementation may allow the maximum level to be set to a smaller level than the available levels in the collation element array. For example, if the maximum level is set to 2, then level 3 and higher weights are not appended to the sort key. Thus any differences at levels 3 and higher will be ignored, leveling any such differences in string comparison.

Here is a more detailed statement of the algorithm:

S3.1 For each weight level L in the collation element array from 1 to the maximum level,
S3.2 If L is not 1, append a level separator*
S3.3 If the collation element table is forwards at level L,
S3.4 For each collation element CE in the array
S3.5 Append CEL to the sort key if CEL is non-zero.
S3.6 Else the collation table is backwards at level L, so
S3.7 Form a list of all the non-zero CEL values.
S3.8 Reverse that list
S3.9 Append the CEL values from that list to the sort key.

* The level separator is zero (0000), which is guaranteed to be lower than any weight in the resulting sort key. This guarantees that when two strings of unequal length are compared, where the shorter string is a prefix of the longer string, the longer string is always sorted after the shorter (in the absence of special features like contractions).

For example:

"abc" < "abcX" where "X" can be any character(s)

S3.10 If a semi-stable sort is required, then after all the level weights have been added, append a copy of the NFD version of the original string. This strength level is called "identical".

4.4. Compare sort keys

Step 4. Compare the sort keys for each of the input strings, using a binary comparison. This means that: Level 3 differences are ignored if there are any Level 1 or 2 differences Level 2 differences are ignored if there are any Level 1 differences Level 1 differences are never ignored.

5. Implementation issues

For the efficiency of algorithm, implementations may vary from this algorithm as long as they produce the same result. At the implementation of UCA there can be identified following issues.

1. reducing sort key length reducing table sizes,

2. reducing table sizes

3. customizing for additional environments

5.1. Reducing Sort Key Lengths

There are methods of reducing sort key lengths. If these methods are applied to all of the sort keys produced by an implementation, they can result in significantly shorter and more efficient sort keys while retaining the same ordering.

5.1.1 Eliminating Level Separators

Level separators are not needed between two levels in the sort key, if the weights are properly chosen.

For example, if all L3 weights are less than all L2 weights, then no level separator is needed between them. If there is a fourth level, then the separator before it needs to be retained.

5.1.2 L2/L3 in 8 Bits

The L2 and L3 weights commonly are small values. Where that condition occurs for all possible values, they can then be represented as single 8-bit quantities. Here is the above example with both these changes (and grouping by bytes). Note that the separator has to remain after the primary weight when combining these techniques. If any separators are retained (such as before the fourth level), they need to have the same width as the previous level.

5.1.3 Machine Words

Depending on the machine architecture the sort key can be represented as an array of different quantities. For example, comparisons as arrays of 32-bit quantities may be much faster on some machines. If this is done, the original is to be padded with trailing (not leading) zeros as necessary.

5.2 Large Weight Values

If a collation sequence requires more than 65,535 weight values (or 65,024 values where zero bytes are avoided), this can still be accommodated by using multiple collation elements for a single character. For example, suppose that 50,000 UTF-16 supplementary characters are assigned in a particular implementation, and that these are to be sorted after X. Simply assign them all dual collation elements of the form [(X1+1).0000.0000], [yyyy.zzzz.wwww] If there was an element with the primary weight (X1+1), then it also needs to be converted into a dual collation element. The characters will then sort properly with respect to each other and to the rest of the characters. The first collation element is one of the instances where ill-formed collation elements are allowed. Because the second collation element is well-formed and the first element will only occur in combination, ordering is preserved.

5.3 Reducing Table Sizes

The data tables required for full Unicode sorting can be quite sizable. This section discusses ways to significantly reduce the table size in memory. These have very important implications for implementations.

5.4 Avoiding Zero Bytes

If the resulting sort key is to be a C-string, then zero bytes must be avoided. This can be done by: using the value 010116 for the level separator instead of 0000. preprocessing the weight values to avoid zero bytes, such as remapping as follows: x => 010116 + (x / 255)*256 + (x % 255) Where the values are limited to 8-bit quantities (as discussed above), zero bytes are even more easily avoided by just using 01 as the level separator (where one is necessary), and mapping weights by x => 01 + x.

5.5 Avoiding Normalization

Implementations that do not handle separate combining marks can map decomposable characters (such as "à") to single collation elements with different Level 2 weights for the different accents. For more information, see Section 7, Weight Derivation. However, this does require including the mappings for these characters in the collation table, which will increase the size substantially unless the collation elements for the Hangul Syllables are computed algorithmically.

5.6 Case Comparisons

In some languages, it is common to sort lowercase before uppercase; in other languages this is reversed. Often this is more dependent on the individual concerned, and is not standard across a single language. It is strongly recommended that implementations provide parameterization that allow uppercase to be sorted before lowercase, and provide information as to the standard (if any) for particular countries. This can easily be done to the Default Unicode Collation Element Table before tailoring by remapping the L3 weights (see Section 7, Weight Derivation). It can be done after tailoring by finding the case pairs and swapping the collation elements.

5.7 Incremental Comparison

Implementations do not actually have to produce full sort keys. Collation elements can be incrementally generated as needed from two strings, and compared with an algorithm that produces the same results as sort keys would have. The choice of which algorithm to use depends on the number of comparisons between the same strings.

Generally incremental comparison is more efficient than producing full sort keys if strings are only to be compared once and if they are generally dissimilar, because differences are caught in the first few characters without having to process the entire string.

Generally incremental comparison is less efficient than producing full sort keys if items are to be compared multiple times.

However, it is very tricky to produce an incremental comparison that produces correct results. For example, some implementations have not even been transitive! Be sure to test any code for incremental comparison thoroughly.

5.8 Catching Mismatches

Sort keys from two different tailored collations cannot be compared, because the weights may end up being rearranged arbitrarily. To catch this case, implementations can produce a hash value from the collation data, and prepend it to the sort key. Except in extremely rare circumstances, this will distinguish the sort keys. The implementation then has the opportunity to signal an error.

5.9 Collation implementation

However, collation is not uniform; it varies according to language and culture: Germans, French and Swedes sort the same characters differently. It may also vary by specific application: even within the same language, dictionaries may sort differently than phonebooks or book indices. For non-alphabetic scripts such as East Asian ideographs, collation can be either phonetic or based on the appearance of the character. Collation can also be commonly customized or configured according to user preference, such as ignoring punctuation or not, putting uppercase before lowercase (or vice versa), and so on. Linguistically correct searching also needs to use the same mechanisms: just as "v" and "w" sort as if they were the same base letter in Swedish, a loose search should pick up words with either one of them.

Thus collation implementations must deal with the often-complex linguistic conventions that communities of people have developed over the centuries for ordering text in their language, and provide for common customizations based on user preferences. And while doing all of this, of course, performance is critical.

The conventions that people have developed over the centuries for collating text in their language are often quite complex. Sorting all Unicode characters in a uniform and consistent manner presents a number of challenges. And for any collation mechanisms to be accepted in the marketplace, algorithms that allow for good performance are crucial.

Languages vary not only regarding which types of sorts to use (and in which order they are to be applied), but also in what constitutes a fundamental element for sorting. For example, Swedish treats ä as an individual letter, sorting it after z in the alphabet; German, however, sorts it either like ae or like other accented forms of a, thus following a. In Slovak, the digraph ch sorts as if it were a separate letter after h. Examples from other languages (and scripts) abound. Languages whose writing systems use uppercase and lowercase typically ignore the differences in case, unless there are no other differences in the text.

References: Unicode Collation Algorithm, http://www.unicode.org/unicode/reports/tr10/




Tags: , , ,

Spin up a VPS server in no time flat

Simple setup. Full root access. Straightforward pricing.

DEPLOY VPS SERVER

Leave a Reply



Feedbacks