Puppy
Puppy

Reputation: 146920

Unicode string handling using Windows API

I always assumed that Unicode string handling was some dark art. However, I've seen that the Windows API has functions for comparing Unicode strings, for example. Does that mean that it's actually feasible to write a Unicode string class that can perform simple actions like sorting, equality comparison, and extraction from a file? Or are there hidden gotchas in the use of these functions that makes it actually a really bad idea? I'm just looking at libraries like ICU and they seem incredibly over-complicated compared to what a Unicode string class backed by the Windows API could actually look like, which would resemble the Standard string classes quite closely.

Upvotes: 2

Views: 2235

Answers (4)

Mihai Nita
Mihai Nita

Reputation: 5787

As other pointed out, it is not very difficult, and definitely not a dark art. But one comment: sorting and equality comparison don't have so much to do with Unicode, as they have with local conventions. Because these are locale sensitive operations. For instance German sorts things differently than Swedish, and differently than French.

In Windows you can just use CompareString (or CompareStringEx if you want to go with string locale identifiers). Does the same thing as ICU Collator (C++) or ucol_strcoll (C). At times you will get slightly different results between Windows and ICU, because Windows did everything independently (and sometimes it is worse than ICU, but sometimes is better).

But overall it will be ok (way better than a non-locale aware comparison)

Upvotes: 0

Thomas
Thomas

Reputation: 1160

Thanatos: Equality Comparision:... I only agree in parts to this bullet point. I disagree in the sense that this is not specific for unicode. This kind of complexity has it's root cause in the locale you are using. Any character encoding needs to support this kind of language specific feature if it claims to support the corresponding locale. And of course, providing support for something like this in a string class library (or font set) is tedious to implement.

Moreover this kind of support is possible only to some extent. Consider the German umlaut 'ü'. A possible replacement for this letter in the German locale is the letter combination 'ue'. The word 'bügeln' (German for ironing) and 'buegeln' would appear in the same place in a dictionary. Go ahead and try this in the German -- English dictionary at www.leo.org. Everybody who knows German and the meaning of the word 'bügeln' will recognise that 'buegeln' means the same thing.

This does not mean that ü = ue in German. The name 'Ruegger', as an example, is pronounced Ru-egger (no ü, there is a glottal stop befor the e) and if a word like 'Rügger' existed then 'Ruegger' would appear before 'Rügger' in a dictionary (because u and ü are usually considered equivalent, as far as lexicographical sorting is concerned, and e comes before g). You need to know the words to be able to tell this difference. This kind of language specific complexity is not due to the fact that anyone might be using unicode to encode the characters used to write down that language. Whatever encoding you are using and whatever string class you are using the application developer needs to know the specifics of the language and how and to which extent the string class supports them.

Most people in the English speaking world never become aware of this complexity because their language is not very complex in this regard, and they got used to the complexity they encounter in their everyday work. (Just tell me why, in ASCII, all capital letters come before the lower-case ones. Why is it not A, a, B, b, C, c? It's just a convention everybody got used to. It's a pain if you need to write a dictionary where a and A should appear in the same place). When it comes to unicode, this complexity suddenly seems to become relevant because you are confronted with a concept which claims to support almost any locale on the world.

What is relevant, however, is the fact that if you are switching from some other encoding to unicode, then you need to take into account that things like sorting and equality checks might be treated differently in a unicode enabled string library. In particular you have all reasons to get nervous if someone starts to talk about unicode migration for a software project with lots of string manipulaton in it. Such a migration implies tons of homework, and one reason is exactly a difference when it comes to sorting and equality checking of strings. Another reason is that unicode character encodings need more space than the classical ANSI character encodings, making char** migrations a real headache.

Upvotes: 0

Thanatos
Thanatos

Reputation: 44256

Does that mean that it's actually feasible to write a Unicode string class that can perform simple actions like sorting, equality comparison, and extraction from a file?

Yes. C#, Java, .Net, Python, (the list goes on) have Unicode strings as basic types, and even C/C++ with libraries like ICU all have this.

Or are there hidden gotchas in the use of these functions that makes it actually a really bad idea?

Yes, there are gotchas. Less yes on the "bad idea". Lets take the examples you posted: "sorting, equality comparison, and extraction from a file".

  • Extraction from a file: This task is quite easy, if you know what character encoding your file is in. Most languages provide some means of reading a file in, and translating from bytes to the "Unicode" type of that language. (For example, in python, data = file_handle.read() to read from a file, then data.decode(encoding_my_file_uses) gets me a unicode string object back. (or a str in Python 3)

  • Equality Comparision: Things get a little hairier here. The basic building block of Unicode is "code points". A Unicode string is nothing more than a sequence of code points. However, Unicode includes code points for accents that combine with the previous character, but it also has some code points with the accent "precomposed". é might be 2 code points (e + accute) or 1 code point. If I have two strings, one with the 2 code point version, and one with the 1 code point version… are they the same? The answer may depend on what you want. Likewise, if you have a character with multiple accents (common in Vietnamese), the accents could be in any order.

    The key? You have to be aware of what kind of equality you want. Case insensitive equality operations make this even more fun, as different languages have different ideas about what the upper or lower case version of a letter is. That said, Unicode defines and provides methods for getting code points in certain orders (a way to normalize strings) to make these things easier. Libraries like ICU, and even some languages' standard libraries, have these already implemented for you in various functions.

  • Sorting: Sorting is much like equality really. You need to be aware of what you actually want. Sort order can be language dependent. To me, ä and a are both "a's" and should be sorted together, but this isn't always true. (Some languages put ä after z.) Another example: where does 丵 sort to? As an English speaker, I don't have a good answer, other than "either before or after everything else". The simplest sort is to just sort by code point order, but doesn't yield anything useful to most humans.

    The answer here is similar: Unicode defines methods for how to do it, and various libraries (like ICU) implement these methods.

ICU, for example, should have the ability to all of this for you in a relatively easy fashion. .Net includes methods for this as well. While the above might seem complex, I've found that most code I've ever written does not do manipulations that require most the above. Most of the time, you're just putting strings together to make some output message to the user: a good formatting routine is all you need. (Like Python's unicode.format, or .Net's String.Format: anything that allows positional notation such as "The {0} was in the {1}".) Rarely, you need to sort information to the user: that's simply "figure out the appropriate locale for this user, sort this array using that locale, output."

If you've never used Unicode before, the first big step then is to just use it. Depending on your language, you may already be, but just unaware of it. Google for tutorials, read the Wikipedia articles. The bigger key, IMHO, is that if you're handling text data, you must be aware of what encoding it is in. Today, that answer, if it is known, is almost always "UTF-8" for serialized bytes, or for in memory stuff, "UTF-16" or "UTF-8".

Upvotes: 3

Thomas
Thomas

Reputation: 1160

Unicode is the way to go in the future, look at http://msdn.microsoft.com/en-us/library/windows/desktop/dd374089%28v=vs.85%29.aspx for example, where they already say 'some newer functions support only unicode versions'. Mark the word 'newer'. .Net string classes are unicode, as an example, as are Java string classes.

Using unicode is not a dark art, actually it makes working with different languages incredibly easy. In a spare time project of mine I'm using JSPs to accept user input for a dictionary in two languages (of user choice), then process them (sorting, extracting substrings, searching, concatenating) in Java and finally write them to a DB using JDBC. Afterwards I can search and retrieve them from DB, process them and display them on a http page. I had to configure my development environment to support UTF-8 and use UTF-8 consistently, but from the moment I did so, this works for every language/keyboard layout the OS supports without me even bothering any more. Including Japanese, Arabic, Devanagari, Russian. A simple mouse click changes the keyboard layout and the program works all the same. This works in linux, windows xp, windows 7, be it 32 bit or 64 bit. The DB I'm using supports this in all these environments, the dev environment (eclipse/Java), too. I simply don't have to care anymore. Of course, if you sort arabic strings you should know something about the arabic language, about your sort algorithm and about string comparison for the string classes you are using. But this is usually documented.

Configuring the develepment environment means of course you know the places where it is relevant. These comprises but is not limited to the string classes you will be using, the encoding used by your editor, the encoding the templates (for XML, HTML, resource files, whatever) are using, the database tables,.... But once you have set it up using consistently one and only one character encoding, this is a very powerful and extremely easy to use setup.

You don't even have to bother about unicode details. If you do, you will find, for example, that one can find out in which range all characters from a certain locale are located, and you can extract all arabic text from a unicode string by just extracting that character range. Really cute.

The point is to consistently use one encoding throughout the solution. If there are different encodings in use and you are not aware of it, this is likely to become the root cause of a serious headache. If you are consciously using different character encodings simultaneously and it works out correctly, then this may be, in fact, close to some dark art :-) Which you will need to use if you have to link against libraries which don't support it. The same applies to libraries which are not using it consistently, of course.

(Of course, even if you are use one particular encoding, you have to make yourself familiar especially with the string classes you are using. So if you do not need support for more than one language the easiest way is to just use the default setup of you development environment for you locale).

Upvotes: 0

Related Questions