membersound
membersound

Reputation: 86747

How to remove all Extended ASCII characters, but not umlauts?

I'd like to remove all ASCII extended characters from an input. (reference: http://www.theasciicode.com.ar/extended-ascii-code/letter-a-umlaut-diaeresis-a-umlaut-lowercase-ascii-code-132.html).

I could therefore use CharMatcher.ASCII, but further I'd like to keep german umlauts, which are contained within the extended char set. So, how could I achieve this?

Upvotes: 1

Views: 993

Answers (3)

Stephen C
Stephen C

Reputation: 718826

If you want use the Guava CharMatcher class for this task, then you can compose matchers using the and(CharMatcher) and or(CharMatcher) methods, etcetera. For example:

CharMatcher asciiPlusUmlauts = 
    CharMatcher.ASCII.or(CharMatcher.anyOf("ÄäÖöÜüß"));

You get the idea?

Upvotes: 2

kpentchev
kpentchev

Reputation: 3090

Take a look at Lucene's org.apache.lucene.analysis.ASCIIFoldingFilter. It does exactly what you require in an efficient way. It does the folding by checking for each char whether or not it is smaller than \u0080 (i.e. character code point 128). If it is, you can leave it as it is (it is an ASCII character), otherwise you have to handle it in some way. For more details on the Unicode Latin character take a look at http://en.wikipedia.org/wiki/Latin_characters_in_Unicode

Upvotes: 0

Bohemian
Bohemian

Reputation: 425033

What about using a whitelist:

input = input.replaceAll("[^\\p{ASCII}ÄäÖöÜüß]", "");

The character class is all ASCII chars plus the umlauts (and I threw in esszet too)

In action:

System.out.println("a\tb© ½Ü, ß".replaceAll("[^\\p{ASCII}ÄäÖöÜüß]", ""));

Output:

a   b Ü, ß

Upvotes: 1

Related Questions