Reputation: 364
I would like to be able to transliterate German umlaut characters
Ü ü ö ä Ä Ö
into their DIN 5007-2 alternatives
ä → ae
ö → oe
ü → ue
Ä → Ae
Ö → Oe
Ü → Ue
ß → ss (or SZ)
like in this case:
The most relevant use case I found was: https://github.com/elastic/elasticsearch-analysis-icu/blob/master/src/test/java/org/elasticsearch/index/analysis/SimpleIcuCollationTokenFilterTests.java
where on line 208 they do
String DIN5007_2_tailorings =
"& ae , a\u0308 & AE , A\u0308"+
"& oe , o\u0308 & OE , O\u0308"+
"& ue , u\u0308 & UE , u\u0308";
I would like to avoid creating complex Java code, like defining custom tailorings and all that's required. I want to keep the code as simple as possible, because I have to use this code inside a ColdFusion application.
I experimented a little with
var instance = Transliterator.getInstance("Latin-ASCII");
and
var instance = Transliterator.getInstance("any-NFD; [:nonspacing mark:] any-remove; any-NFC");
and their variants, they all result in:
writeDump(instance.transliterate('Häuser Bäume Höfe Gärten daß Ü ü ö ä Ä Ö ß '));
Hauser Baume Hofe Garten dass U u o a A O ss
If it's possible I would like to stick to using .getInstance() method. Question here is what is the ID string for the .getInstance() method that would result in transliterating umlauts into their DIN 5007-2 equivalents?
Upvotes: 3
Views: 2445
Reputation: 666
Unfortunately, "de-ASCII" does not transform the "€" symbol to "EUR" as it would be done by iconv
. To achieve this, you have to create a Transliterator
instance from a set of rules. The code sample below shows how to create such a variant of "de-ASCII" with the transformation of "€" to "EUR". The rules are based on those of "de-ASCII" as they are returned by Transliterator.getInstance("de-ASCII").toRules(true))
plus the added rule for the euro symbol.
final var rules = """
[\\u00E4{a\\u0308}] > ae;
[\\u00F6{o\\u0308}] > oe;
[\\u00FC{u\\u0308}] > ue;
{[\\u00C4{A\\u0308}]}[:Lowercase:] > Ae;
{[\\u00D6{O\\u0308}]}[:Lowercase:] > Oe;
{[\\u00DC{U\\u0308}]}[:Lowercase:] > Ue;
[\\u00C4{A\\u0308}] > AE;
[\\u00D6{O\\u0308}] > OE;
[\\u00DC{U\\u0308}] > UE;
[\\u20AC] > EUR;
::Any-ASCII;""";
final var instance = Transliterator.createFromRules("de_EUR-ASCII", rules, Transliterator.FORWARD);
Upvotes: 1
Reputation: 131
Updating on this as there is now a simple solution using "de-ASCII":
Transliterator transliterator = Transliterator.getInstance("de-ASCII");
String umlautReplaced = transliterator.transliterate(txt);
Upvotes: 4
Reputation: 21
You can create one with a rules string, like:
ä → ae;
ö → oe;
ü → ue;
Ä → Ae;
Ö → Oe;
Ü → Ue;
ß → ss;
You can see this on:
However, you may want a slightly more sophisticated approach, because your rules will map HÄUSER to HAeUSER.
The rules allow for context, so you can do the following:
$beforeLower = [[:Mn:][:Me:]]* [:Lowercase:] ;
ä → ae;
ö → oe;
ü → ue;
Ä } $beforeLower → Ae;
Ö } $beforeLower → Oe;
Ü } $beforeLower → Ue;
Ä → AE;
Ö → OE;
Ü → UE;
ß → ss;
giving the following
ä ö ü Ä Ö Ü Ät Öt Üt ß → ae oe ue AE OE UE Aet Oet Uet ss
Upvotes: 2