Krystian Marek
Krystian Marek

Reputation: 364

Transliterate German umlauts using icu4j into their DIN 5007-2 alternatives

I would like to be able to transliterate German umlaut characters

Ü ü ö ä Ä Ö

into their DIN 5007-2 alternatives

ä → ae
ö → oe
ü → ue
Ä → Ae
Ö → Oe
Ü → Ue
ß → ss (or SZ)

like in this case:

https://german.stackexchange.com/questions/4992/conversion-table-for-diacritics-e-g-%C3%BC-%E2%86%92-ue

The most relevant use case I found was: https://github.com/elastic/elasticsearch-analysis-icu/blob/master/src/test/java/org/elasticsearch/index/analysis/SimpleIcuCollationTokenFilterTests.java

where on line 208 they do

String DIN5007_2_tailorings =
            "& ae , a\u0308 & AE , A\u0308"+
            "& oe , o\u0308 & OE , O\u0308"+
            "& ue , u\u0308 & UE , u\u0308";

I would like to avoid creating complex Java code, like defining custom tailorings and all that's required. I want to keep the code as simple as possible, because I have to use this code inside a ColdFusion application.

I experimented a little with

var instance = Transliterator.getInstance("Latin-ASCII");

and

var instance = Transliterator.getInstance("any-NFD; [:nonspacing mark:] any-remove; any-NFC");

and their variants, they all result in:

 writeDump(instance.transliterate('Häuser Bäume Höfe Gärten daß Ü ü ö ä Ä Ö ß '));

 Hauser Baume Hofe Garten dass U u o a A O ss 

If it's possible I would like to stick to using .getInstance() method. Question here is what is the ID string for the .getInstance() method that would result in transliterating umlauts into their DIN 5007-2 equivalents?

Upvotes: 3

Views: 2445

Answers (3)

Sephiroth
Sephiroth

Reputation: 666

Unfortunately, "de-ASCII" does not transform the "€" symbol to "EUR" as it would be done by iconv. To achieve this, you have to create a Transliterator instance from a set of rules. The code sample below shows how to create such a variant of "de-ASCII" with the transformation of "€" to "EUR". The rules are based on those of "de-ASCII" as they are returned by Transliterator.getInstance("de-ASCII").toRules(true)) plus the added rule for the euro symbol.

        final var rules = """
                          [\\u00E4{a\\u0308}] > ae;
                          [\\u00F6{o\\u0308}] > oe;
                          [\\u00FC{u\\u0308}] > ue;
                          {[\\u00C4{A\\u0308}]}[:Lowercase:] > Ae;
                          {[\\u00D6{O\\u0308}]}[:Lowercase:] > Oe;
                          {[\\u00DC{U\\u0308}]}[:Lowercase:] > Ue;
                          [\\u00C4{A\\u0308}] > AE;
                          [\\u00D6{O\\u0308}] > OE;
                          [\\u00DC{U\\u0308}] > UE;
                          [\\u20AC] > EUR;
                          ::Any-ASCII;""";
        final var instance = Transliterator.createFromRules("de_EUR-ASCII", rules, Transliterator.FORWARD);

Upvotes: 1

Hui Gui
Hui Gui

Reputation: 131

Updating on this as there is now a simple solution using "de-ASCII":

Transliterator transliterator = Transliterator.getInstance("de-ASCII");
String umlautReplaced = transliterator.transliterate(txt);

Upvotes: 4

MED
MED

Reputation: 21

You can create one with a rules string, like:

ä → ae;
ö → oe;
ü → ue;
Ä → Ae;
Ö → Oe;
Ü → Ue;
ß → ss;

You can see this on:

http://unicode.org/cldr/utility/transform.jsp?a=%C3%A4+%E2%86%92+ae%3B%0D%0A%C3%B6+%E2%86%92+oe%3B%0D%0A%C3%BC+%E2%86%92+ue%3B%0D%0A%C3%84+%E2%86%92+Ae%3B%0D%0A%C3%96+%E2%86%92+Oe%3B%0D%0A%C3%9C+%E2%86%92+Ue%3B%0D%0A%C3%9F+%E2%86%92+ss%3B&b=H%C3%A4user+B%C3%A4ume+H%C3%B6fe+G%C3%A4rten+da%C3%9F+%C3%9C+%C3%BC+%C3%B6+%C3%A4+%C3%84+%C3%96+%C3%9F+

However, you may want a slightly more sophisticated approach, because your rules will map HÄUSER to HAeUSER.

The rules allow for context, so you can do the following:

$beforeLower = [[:Mn:][:Me:]]* [:Lowercase:] ;

ä → ae;
ö → oe;
ü → ue;

Ä } $beforeLower → Ae;
Ö } $beforeLower → Oe;
Ü } $beforeLower → Ue;

Ä → AE;
Ö → OE;
Ü → UE;
ß → ss;

giving the following

ä ö ü Ä Ö Ü Ät Öt Üt ß → ae oe ue AE OE UE Aet Oet Uet ss

Upvotes: 2

Related Questions