KB Knowledge Base
Unicode Collation Algorithm

Article By jeevan01


img

1. Introduction

Simply, Unicode collation algorithm (UCA) is just another algorithm that can be used to compare two Unicode strings. There are many other mechanisms for doing this comparison and this algorithm becomes unique among those mechanisms because of UCA’s specific goals.

2. Collation

One of the most frequently used data types is the string data type; one of the most common operations on a set of strings is arranging them in their natural order: databases, search engines, and numerous other applications rely on a sorted order. Collation involves addressing this issue. Assumption instance can be taken to explain the collation as following. relative order of two strings of characters can be introduced as some qualities of string collation are mentioned below.

cat < cot Cat < cat THEN it provides basis for sorting cot, Cat, cat Cat, cat, cot


1. Collation is not a binary order i.e. the collation order of two characters is not the same as the arithmetic order of their Unicode values.

2. Collation is not aligned with character set. Two languages having same character set may have different collation sequence. German ö3. Collation is not a property of strings rather it is the property of a language. Collation can be identified as a general term for the process and function of determining the sorting order of strings of characters. It is a key function in computer systems; whenever a list of strings is presented to users, they are likely to want it in a sorted order so that they can easily and reliably find individual strings. Thus it is widely used in user interfaces. It is also crucial for the operation of databases, not only in sorting records but also in selecting sets of records with fields within given bounds.

3. What is Unicode Collation Algorithm (UCA)

The Unicode Collation Algorithm (UCA) provides a specification for how to compare two Unicode strings while remaining conformant to the requirements of The Unicode Standard. The UCA also supplies the Default Unicode Collation Element Table (DUCET), which is data specifying the default collation order for all Unicode characters. This table is designed so that it can be tailored to meet the requirements of different languages and customization's.

4.Collation implementation

However, collation is not uniform; it varies according to language and culture: Germans, French and Swedes sort the same characters differently. It may also vary by specific application: even within the same language, dictionaries may sort differently than phonebooks or book indices. For non-alphabetic scripts such as East Asian ideographs, collation can be either phonetic or based on the appearance of the character. Collation can also be commonly customized or configured according to user preference, such as ignoring punctuation or not, putting uppercase before lowercase (or vice versa), and so on. Linguistically correct searching also needs to use the same mechanisms: just as "v" and "w" sort as if they were the same base letter in Swedish, a loose search should pick up words with either one of them.

Thus collation implementations must deal with the often-complex linguistic conventions that communities of people have developed over the centuries for ordering text in their language, and provide for common customization based on user preferences. And while doing all of this, of course, performance is critical. The conventions that people have developed over the centuries for collating text in their language are often quite complex. Sorting all Unicode characters in a uniform and consistent manner presents a number of challenges. And for any collation mechanisms to be accepted in the marketplace, algorithms that allow for good performance are crucial.

Languages vary not only regarding which types of sorts to use (and in which order they are to be applied), but also in what constitutes a fundamental element for sorting. For example, Swedish treats ä as an individual letter, sorting it after z in the alphabet; German, however, sorts it either like are or like other accented forms of a, thus following a. In Slovak, the digraph ch sorts as if it were a separate letter after h. Examples from other languages (and scripts) abound. Languages whose writing systems use uppercase and lowercase typically ignore the differences in case, unless there are no other differences in the text. References: Unicode Collation Algorithm, http://www.unicode.org/unicode/reports/tr10/




Tags: , , , ,

Spin up a VPS server in no time flat

Simple setup. Full root access. Straightforward pricing.

DEPLOY VPS SERVER

Leave a Reply



Feedbacks