Reputation: 5655
Ok, so I feel like this question for asked many times but I am not able to find an answer. I am comparing two different files that were generated by two different programs. Of course both programs are generating the files from the same db queries. I am running into the following differences:
s1 =
Samsung - Mobile USB Chargers
vs.
s2 =
Samsung \u2013 Mobile USB Chargers
How do I convert s2 to s1 or even better, how do I compare the two without getting a difference? Someone somewhere on the wide wide internets mentioned to use ApacheCommons-lang's StringUtils class, but I couldn't find anything useful.
Upvotes: 1
Views: 3798
Reputation: 108879
You could fold all the characters with the Dash_Punctuation property.
This code will print true
:
boolean equal = "Samsung \u2013 Mobile USB Chargers"
.replaceAll("\\p{Pd}", "-")
.equals("Samsung - Mobile USB Chargers");
System.out.println(equal);
Note that this will apply to all characters with that property (like 〰 U+3030 WAVY DASH). A comprehensive list of characters with the Dash_Punctuation (Pd) property are in UnicodeData.txt. Java 6 supports Unicode 4. See chapter 6 for a discussion of punctuation.
Upvotes: 2
Reputation: 59463
The program that generated the first string is writing the file in ASCII, using a character substitution fallback mechanism. The second is writing the file in Unicode.
These could be compared by making a copy of the second file in ASCII using the same fallback mechanism.
The best solution would be to modify the first program so that it also uses Unicode.
(It is possible that the second file was using something other than Unicode, since some other character sets include the en dash. If so, then the best solution is to write both files in Unicode, if possible.)
Upvotes: 1