wojtek
wojtek

Reputation: 503

Removing accents from String

Recentrly I found very helpful method in StringUtils library which is

StringUtils.stripAccents(String s)

I found it really helpful with removing any special characters and converting it to some ASCII "equivalent", for instace ç=c etc.

Now I am working for a German customer who really needs to do such a thing but only for non-German characters. Any umlauts should stay untouched. I realised that strinAccents won't be useful in that case.

Does anyone has some experience around that stuff? Are there any useful tools/libraries/classes or maybe regular expressions? I tried to write some class which is parsing and replacing such characters but it can be very difficult to build such map for all languages...

Any suggestions appriciated...

Upvotes: 6

Views: 10295

Answers (3)

eis
eis

Reputation: 53462

My gut feeling tells me the easiest way to do this would be to just list allowed characters and strip accents from everything else. This would be something like

import java.util.regex.*;
import java.text.*;

public class Replacement {
    public static void main(String args[]) {
        String from = "aoeåöäìé";
        String result = stripAccentsFromNonGermanCharacters(from);
        
        System.out.println("Result: " + result);
    }

    private static String patternContainingAllValidGermanCharacters =
                                            "a-zA-Z0-9äÄöÖéÉüÜß";
    private static Pattern nonGermanCharactersPattern =
        Pattern.compile("([^" + patternContainingAllValidGermanCharacters + "])");

    public static String stripAccentsFromNonGermanCharacters(
           String from) {
        return stripAccentsFromCharactersMatching(
            from, nonGermanCharactersPattern);
    }

    public static String stripAccentsFromCharactersMatching(
        String target, Pattern myPattern) {

        StringBuffer myStringBuffer = new StringBuffer();
        Matcher myMatcher = myPattern.matcher(target);
        while (myMatcher.find()) {
            myMatcher.appendReplacement(myStringBuffer,
                stripAccents(myMatcher.group(1)));
        }
        myMatcher.appendTail(myStringBuffer);

        return myStringBuffer.toString();
    }


    // pretty much the same thing as StringUtils.stripAccents(String s)
    // used here so I can demonstrate the code without StringUtils dependency
    public static String stripAccents(String text) {
        return Normalizer.normalize(text,
            Normalizer.Form.NFD)
           .replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
    }
}

(I realize the pattern doesn't probably contain all the characters needed, but add whatever is missing)

Upvotes: 2

Paul Vargas
Paul Vargas

Reputation: 42020

Best built a custom function. It can be like the following. If you want to avoid the conversion of a character, you can remove the relationship between the two strings (the constants).

private static final String UNICODE =
        "ÀàÈèÌìÒòÙùÁáÉéÍíÓóÚúÝýÂâÊêÎîÔôÛûŶŷÃãÕõÑñÄäËëÏïÖöÜüŸÿÅåÇçŐőŰű";
private static final String PLAIN_ASCII =
        "AaEeIiOoUuAaEeIiOoUuYyAaEeIiOoUuYyAaOoNnAaEeIiOoUuYyAaCcOoUu";

public static String toAsciiString(String str) {
    if (str == null) {
        return null;
    }
    StringBuilder sb = new StringBuilder();
    for (int index = 0; index < str.length(); index++) {
        char c = str.charAt(index);
        int pos = UNICODE.indexOf(c);
        if (pos > -1)
            sb.append(PLAIN_ASCII.charAt(pos));
        else {
            sb.append(c);
        }
    }
    return sb.toString();
}

public static void main(String[] args) {
    System.out.println(toAsciiString("Höchstalemannisch"));
}

Upvotes: 4

Nitesh Verma
Nitesh Verma

Reputation: 1815

This might give you a work around. here you can detect the language and get the specific text only.

EDIT: You can have the raw string as an input, put the language detection to German and then it will detect the German characters and will discard the remaining.

Upvotes: 0

Related Questions