Reputation: 3159
I am using the following link to create a hashmap of key = unicode value of characters and value being the actual character it should map to - https://github.com/lmjabreu/solr-conftemplate/blob/master/mapping-ISOLatin1Accent.txt
So far I have written the following code to remove accents from the string
public class ACCENTS {
public static void main(String[] args){
// this is the hashmap that stores the mappings of the characters to their ascii equivalent
HashMap<Character, Character> characterMappings = new HashMap<>();
characterMappings.put('\u00C0', 'A');
characterMappings.put('\u00C1', 'A');
characterMappings.put('\u00C2', 'A');
characterMappings.put('\u00C3', 'A');
characterMappings.put('\u00C4', 'A');
characterMappings.put('\u00C5', 'A');
characterMappings.put('\u00C7','C');
characterMappings.put('\u00C8', 'E');
characterMappings.put('\u00C9','E');
characterMappings.put('\u00CA', 'E');
characterMappings.put('\u00CB', 'E');
characterMappings.put('\u00CC', 'I');
characterMappings.put('\u00CD', 'I');
characterMappings.put('\u00CE', 'I');
characterMappings.put('\u00CF', 'I');
characterMappings.put('\u00D0', 'D');
characterMappings.put('\u00D1', 'N');
characterMappings.put('\u00D2', 'O');
characterMappings.put('\u00D3', 'O');
characterMappings.put('\u00D4', 'O');
characterMappings.put('\u00D5', 'O');
characterMappings.put('\u00D6', 'O');
characterMappings.put('\u00D8', 'O');
characterMappings.put('\u00D9', 'U');
characterMappings.put('\u00DA', 'U');
characterMappings.put('\u00DB', 'U');
characterMappings.put('\u00DC', 'U');
characterMappings.put('\u00DD', 'Y');
characterMappings.put('\u0178', 'Y');
characterMappings.put('\u00E0', 'a');
characterMappings.put('\u00E1', 'a');
characterMappings.put('\u00E2', 'a');
characterMappings.put('\u00E3','a');
characterMappings.put('\u00E4', 'a');
characterMappings.put('\u00E5', 'a');
characterMappings.put('\u00E7', 'c');
characterMappings.put('\u00E8', 'e');
characterMappings.put('\u00E9', 'e');
characterMappings.put('\u00EA','e');
characterMappings.put('\u00EB', 'e');
characterMappings.put('\u00EC', 'i');
characterMappings.put('\u00ED', 'i');
characterMappings.put('\u00EE', 'i');
characterMappings.put('\u00EF', 'i');
characterMappings.put('\u00F0', 'd');
characterMappings.put('\u00F1','n' );
characterMappings.put('\u00F2', 'o');
characterMappings.put('\u00F3', 'o');
characterMappings.put('\u00F4', 'o');
characterMappings.put('\u00F5', 'o');
characterMappings.put('\u00F6', 'o');
characterMappings.put('\u00F8', 'o');
characterMappings.put('\u00F9', 'u');
characterMappings.put('\u00FA', 'u');
characterMappings.put('\u00FB', 'u');
characterMappings.put('\u00FC', 'u');
characterMappings.put('\u00FD', 'y');
characterMappings.put('\u00FF', 'y');
String token = "nа̀ра";
String newString = "";
for(int i = 0 ; i < token.length() ; ++i){
if( characterMappings.containsKey(token.charAt(i)) )
newString += characterMappings.get(token.charAt(i));
else
newString += token.charAt(i);
}
System.out.println(newString);
}
}
The expected result should have been "napa" but it turns out no conversion is being performed, what can be a possible cause of deviation for this case, I am not able to find one.
Upvotes: 1
Views: 3157
Reputation: 1053
you ran into some of the ugliest 'features' of Java: One unicode character may be represented by a tupel (and even a tripel) of characters.
In fact, token has a length of 5 chars. á is a combination of two chars and can only be represented as a String.
This is why
characterMappings.put('а̀`', 'y'); //(accent can't be displayed correctly in code-mode, try it yourself)
won't compile.
Here is a more explaination.
In my humble oppinion String is one of the worst classes in Java. Especially if you use 'non standard' characters.
To solve your problem I would suggest changing your map to Map<String,String>
or Map<String,Character>
. This way you can map your 'characters' and as a neat sideeffect your code becomes more readable if you dismiss the escaped unicode-characters.
For more information google for HighSurrogate or CodePoint. CodePoints are valid (=displayable) char-sequences, which - as mentioned before - need not to necessarily correspond with the number of chars in a String.
This is necessary because a Java-Character is just 2 byte wide. To small for all unicode characters, but big enough most of the time (=as long as you use standard latin characters).
Edit:
Even with a Map<String,String>
, your code won't work, cause you still loop over chars. But no single Java-character will match you special unicode-character.
This might help, though it may not work under any circumstances (java strings are nasty after all):
HashMap<String, String> characterMappings = new HashMap<>();
characterMappings.put("а̀", "a");
String token = "nа̀ра";
String newString = "";
for (Entry<String, String> e : characterMappings.entrySet()) {
token = token.replaceAll(e.getKey(), e.getValue());
}
System.out.println(token);
Edit 2
Since posting code as a comment sucks:
String s = "brûlée";
String s1 = Normalizer.normalize(s, Normalizer.Form.NFKD);
String regex = "[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+";
String s2 = new String(s1.replaceAll(regex, "").getBytes("ascii"),
"ascii");
System.out.println(s2);
this works for me with everything I tried so far. Still @Scheintod deserves the credit. Source found here
Best regards
sam
Upvotes: 1
Reputation: 8105
Not shure why you want to use a HashMap. But if you just want to remove the diacritics perhaps this helps:
String s = "nа̀ра";
s = Normalizer.normalize( s, Normalizer.Form.NFD );
s = s.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
System.out.println( s );
--> napa
(If you insist on using the HashMap you should have still a look a the 'Normalizer' class because it can work in the other direction, too.)
Taken from this article: http://blog.smartkey.co.uk/2009/10/how-to-strip-accents-from-strings-using-java-6/
Upvotes: 5