Reputation: 86747
I'd like to remove all ASCII
extended characters from an input. (reference: http://www.theasciicode.com.ar/extended-ascii-code/letter-a-umlaut-diaeresis-a-umlaut-lowercase-ascii-code-132.html).
I could therefore use CharMatcher.ASCII
, but further I'd like to keep german umlauts, which are contained within the extended char set.
So, how could I achieve this?
Upvotes: 1
Views: 993
Reputation: 718826
If you want use the Guava CharMatcher
class for this task, then you can compose matchers using the and(CharMatcher)
and or(CharMatcher)
methods, etcetera. For example:
CharMatcher asciiPlusUmlauts =
CharMatcher.ASCII.or(CharMatcher.anyOf("ÄäÖöÜüß"));
You get the idea?
Upvotes: 2
Reputation: 3090
Take a look at Lucene's org.apache.lucene.analysis.ASCIIFoldingFilter
. It does exactly what you require in an efficient way. It does the folding by checking for each char
whether or not it is smaller than \u0080
(i.e. character code point 128). If it is, you can leave it as it is (it is an ASCII character), otherwise you have to handle it in some way. For more details on the Unicode Latin character take a look at http://en.wikipedia.org/wiki/Latin_characters_in_Unicode
Upvotes: 0
Reputation: 425033
What about using a whitelist:
input = input.replaceAll("[^\\p{ASCII}ÄäÖöÜüß]", "");
The character class is all ASCII chars plus the umlauts (and I threw in esszet too)
In action:
System.out.println("a\tb© ½Ü, ß".replaceAll("[^\\p{ASCII}ÄäÖöÜüß]", ""));
Output:
a b Ü, ß
Upvotes: 1