csharpdefector
csharpdefector

Reputation: 833

How do I compare unicode strings containing non-english characters to sort alpabetically?

I am trying to sort array/lists/whatever of data based upon the unicode string values in them which contain non-english characters, I want them sorted correctly alphabetically.

I have written a lot of code (D2010, win XP), which I thought was pretty solid for future internationalisation, but it is not. Its all using unicodestring (string) data type, which up until now I have just been putting english characters into the unicode strings.

It seems I have to own up to making a very serious unicode mistake. I talked to my German friend, and tried out some German ß's, (ß is 'ss' and should come after S and before T in alphabet) and and ö's etc (note the umlaut) and none of my sorting algorithms work anymore. Results are very mixed up. Garbage.

Since then I have been reading up extensively and learnt a lot of unpleasant things with regards to unicode collation. Things are looking grim, much grimmer than I ever expected, I have seriously messed this up. I hope I am missing something and things are not actually quite as grim as they appear at present. I have been tinkering around looking at windows api calls (RtlCompareUnicodeString) with no success (protection faults), I could not get it to work. Problem with API calls I learnt is that they change on various newer windows platforms, and also with delphi going cross plat soon, with linux later, my app is client server so I need to be concerned about this, but tbh with the situation being what is it (bad) I would be grateful for any forward progress, ie win api specific.

Is using win api function RtlCompareUnicodeString to obvious solution? If so I should really try again with that but tbh I have been taken aback by all of the issues involved with unicode collation and I not clear at all what I should be doing to compare these strings this way anyway.

I learnt of the IBM ICU c++ opensource project, there is a delphi wrapper for it albeit for an older version of ICU. It seems a very comprehensive solution which is platform independant. Surely I cannot be looking at creating a delphi wrapper for this (or updating the existing one) to get a good solution for unicode collation?

I would be extremely glad to hear advice at two levels :-

A) A windows specific non portable solution, I would be glad off that at the moment, forget the client server ramifications! B) A more portable solution which is immune from the various XP/vista/win7 variations of unicode api functions, therefore putting me in good stead for XE2 mac support and future linux support, not to mention the client server complications.

Btw I dont really want to be doing 'make-do' solutions, scanning strings prior to comparison and replacing certain tricky characters etc, which I have read about. I gave the German examplle above, thats just an example, I want to get it working for all (or at least most, far east, russian) languages, I don't want to do workarounds for a specific language or two. I also do not need any advice on the sorting algorithms, they are fine, its just the string comparison bit that's wrong.

I hope I am missing/doing something stupid, this all looks to be a headache.

Thank you.


EDIT, Rudy, here is how I was trying to call RtlCompareUnicodeString. Sorry for the delay I have been having a horrible time with this.

program Project26

{$APPTYPE CONSOLE}

uses
  SysUtils;


var
  a,b:ansistring;

  k,l:string;
  x,y:widestring;
  r:integer;

procedure RtlInitUnicodeString(
  DestinationString:pstring;
  SourceString:pwidechar) stdcall; external 'NTDLL';

function RtlCompareUnicodeString(
  String1:pstring;
  String2:pstring;
  CaseInSensitive:boolean
  ):integer stdcall; external 'NTDLL';


begin

  x:='wef';
  y:='fsd';

  RtlInitUnicodeString(@k, pwidechar(x));
  RtlInitUnicodeString(@l, pwidechar(y));

  r:=RtlCompareUnicodeString(@k,@l,false);

  writeln(r);
  readln;

end.

I realise this is most likely wrong, I am not used to calling api unctions directly, this is my best guess.

About your StringCompareEx api function. That looked really good, but is avail on Vista + only, I'm using XP. StringCompare is on XP, but that's not Unicode!

To recap, the basic task afoot, is to compare two strings, and to do so based on the character sort order specified in the current windows locale.

Can anyone say for sure if ansicomparetext should do this or not? It don't work for me, but others have said it should, and other things i have read suggest it should.

This is what I get with 31 test strings when using AnsiCompareText when in German Locale (space delimited - no strings contain spaces) :-


EDIT 2.

I am still keen to hear if I should expect AnsiCompareText to work using the locale info, as lkessler has said so, and lkessler has also posted about these subjects before and seems have been through this before.

However, following on from Rudy's advice I have also been checking out CompareStringW - which shares the same documentation with CompareString, so it is NOT non-unicode as I have stated earlier.

Even if AnsiCompareText is not going to work, although I think it should, the win32api function CompareStringW should indeed work. Now I have defined my API function, and I can call it, and I get a result, and no error... but i get the same result everytime regardless of the input strings! It returns 1 everytime - which means less than. Here's my code

var
  k,l:string;

function CompareStringW(
  Locale:integer;
  dwCmpFlags:longword;
  lpString1:pstring;
  cchCount1:integer;
  lpString2:pstring;
  cchCount2:integer
  ):integer stdcall; external 'Kernel32.dll';

begin;

  k:='zzz';
  l:='xxx';

  writeln(length(k));
  r:=comparestringw(LOCALE_USER_DEFAULT,0,@k,3,@l,3);

  writeln(r); // result is 1=less than, 2=equal, 3=greater than
  readln;

end;

I feel I am getting somewhere now after much pain. Would be glad to know about AnsiCompareText, and what I am doing wrong with the above CompareStringW api call. Thank you.


EDIT 3

Firstly, I fixed the api call to CompareStringW myself, I was passing in @mystring when I should do PString(mystring). Now it all works correctly.

r:=comparestringw(LOCALE_USER_DEFAULT,0,pstring(k),-1,pstring(l),-1);

Now, you can imagine my dismay when I still got the same sort result as I did right at the beginning...

You may also imagine my EXTREME dismay not to mention simultaneous joy when I realised the sort order IS CORRECT, and IT WAS CORRECT RIGHT BACK IN THE BEGGINING! It make sme sick to say it, but there was never any problem in the first place - this is all down to my lack of German knowledge. I beleived the sort was wrong, since you can see above string start with S, then later they start with ß, then s again and back to ß and so on. Well I can't speak German however I could still clearly see that they was not sorted correctly - my German friend told me ß comes after S and before T... I WAS WRONG! What is happening is that string functions (both AnsiCompareText and winapi CompareTextW) are SUBSTITUTING every 'ß' with 'ss', and every 'ö' with a normal 'o'... so if i take those result above and to a search and replace as described I get...

Looks pretty correct to me! And it always was.

I am extremely grateful for all the advice given, and extremely sorry to have wasted your time like this. Those german ß's got me all confused, there was never nothing wrong with the built in delphi function or anything else. It just looked like there was. I made the mistake of combining them with normal 's' in my test data, any other letter would have not have created this illusion of un-sortedness! The squiggly ß's have made me look a fool! ßs!

Rudy and lkessler we're both especially helpful, ty both, I have to accept lkessler's answer as most correct, sorry Rudy.

Upvotes: 20

Views: 6154

Answers (4)

lkessler
lkessler

Reputation: 20132

Try using CompareStr for case sensitive, or CompareText for case insensitive if you want your sorts exactly the same in any locale.

And use AnsiCompareStr for case sensitive, or AnsiCompareText for case insensitive if you want your sorts to be specific to the locale of the user.

See: How can I get TStringList to sort differently in Delphi for a lot more information on this.

Upvotes: 7

Marjan Venema
Marjan Venema

Reputation: 19346

In Unicode the numeric order of the characters is certainly not the sorting sequence. AnsiCompareText as mentioned by HeartWare does take locale specifics into consideration when comparing characters, but, as you found out, does nothing wrt the sorting order. What you are looking for is called the collation sequence of a language, which specifies the alphabetic sorting order for a language taking diacritics etc into consideration. They were sort of implied in the old Ansi Code pages, though those didn't account for sorting difference between languages using the same character set either.

I checked the D2010 docs. Apart from some TIB* components I didn't find any links. C++ builder does seem to have a compare function that takes collation into account, but that's not much use in Delphi. There you will probably have to use some Windows' API functions directly.

Docs:

The 'Sorting "Collate" all out' article is by Michael Kaplan, someone who has great in-depth knowledge of all things Unicode and all intricacies of various languages. His blog has been invaluable to me when porting from D2006 to D2009.

Upvotes: 3

Rudy Velthuis
Rudy Velthuis

Reputation: 28806

You said you had problems calling Windows API calls yourself. Could you post the code, so people here can see why it failed? It is not as hard as it may seem, but it does require some care. ISTM that RtlCompareUnicodeStrings() is too low level.

I found a few solutions:

Non-portable

You could use the Windows API function CompareStringEx. This will compare using Unicode specific collation types. You can specify how you want this done (see link). It does require wide strings, i.e. PWideChar pointers to them. If you have problems calling it, give a holler and I'll try to add some demo code.

More or less portable

To make this more or less portable, you could write a function that compares two strings and use conditional defines to choose the different comparison APIs for the platform.

Upvotes: 10

HeartWare
HeartWare

Reputation: 8243

Have you tried AnsiCompareText ? Even though it is called "Ansi", I believe it calls on to an OS-specific Unicode-able comparison routine...

It should also make you safe from cross-platform dependencies (provided that Embarcadero supplies a compatible version in the various OS's they target).

I do not know how good the comparison works with the various strange Unicode ways to encode strings, but try it out and let us know the result...

Upvotes: 1

Related Questions