Reputation: 244
I have some link resources with none latin characters like åäö These are usually user uploaded files
The problem is that i am not successfull in encoding them
using filename.encodeAsURL seems to not encode it the right way
For example the character ö is turned into o%CC%88 Testing to type the same thing in firefox and copy the contents gives %C3%B6
What are the difference between these encodings and what should i use to get the correct encoding??
Upvotes: 1
Views: 1104
Reputation: 39580
Both encodings are correct. You are actually seeing the encoding of two different strings.
The key here is noticing the o
at the beginning of the string:
o%CC%88
is the letter o
followed by Unicode Character Combining Diaeresis, which combines with the previous character when rendered.
%C3%B6
is Unicode Character Latin Small O With Diaeresis.
What you are seeing is that in the first case, the string entered is something like these two characters: o
¨
, which are actually rendered as ö
.
In the second case, it's the actual character ö
.
My guess is you are seeing the difference between two different inputs.
Update based on below discussion: If you are dynamically processing Unicode characters, and you do not have control over the input methods, you can try to normalize the Unicode, using java.text.Normalizer (Java 1.6 or newer).
Normalizing attempts to ensure that all characters are consistently represented, so that accented characters are always represented by a combined character or always by the character+combining mark.
Rough example:
String.metaClass.normalizeUnicode = {
return java.text.Normalizer.normalize(delegate, java.text.Normalizer.Form.NFC)
}
input = input.normalizeUnicode()
There are four forms of normalization. I picked the one that seems to be best for your case based on the description of how they work, but you may prefer to try the other ones and see what works most consistently.
All that being said, if you are try to representing Unicode characters in a URL, and they are not being loaded and processed by the code directly, it's probably best to avoid using non-latin characters altogether. Not only does this have the benefit of consistently, but also significantly shorter and more legible URLs. boo.pdf
is a lot easier to read than bo%CC%88o.pdf
.
Upvotes: 2