dev2d
dev2d

Reputation: 4262

Why java.net.URLEncoder gives different result for same string?

On the webapp server when I try encoding "médicaux_Jérôme.txt" using java.net.URLEncoder it gives following string:

me%CC%81dicaux_Je%CC%81ro%CC%82me.txt

While on my backend server when I try encoding the same string it gives following:

m%C3%A9dicaux_J%C3%A9r%C3%B4me.txt

Can someone help me understanding the different output for the same input? Also how can I get standardized output each time I decode the same string?

Upvotes: 3

Views: 1617

Answers (1)

acdcjunior
acdcjunior

Reputation: 135762

The outcome depends on the platform, if you don't specify it.

See the java.net.URLEncoder javadocs:

encode(String s)

Deprecated

The resulting string may vary depending on the platform's default encoding. Instead, use the encode(String,String) method to specify the encoding.

So, use the suggested method and specify the encoding:

String urlEncodedString = URLEncoder.encode(stringToBeUrlEncoded, "UTF-8")

About different representations for the same string, if you specified "UTF-8":

The two URL encoded strings you gave in the question, although differently encoded, represent the same unencoded value, so there is nothing inherently wrong there. By writing both in a decode tool, we can verify that they are the same.

This is due, as we are seeing in this case, to the fact that there are multiple ways to URL encode the same string, specially if they have acute accents (due to the combining acute accent, precisely what happens in your case).

To your case, specifically, the first string encoded é as e + ´ (latin small letter e + combining acute accent) resulting in e%CC%81. The second encoded é directly to %C3%A9 (latin small letter e with acute - two % because in UTF-8 it takes two bytes).

Again, there is no problem with either representation. Both are forms of Unicode Normalization. It is known that Mac OS Xs tend to encode using the combining acute accent; in the end, it is a matter of preference of the encoder. In your case, there must be different JREs or, if that file name was user generated, then the user may have used a different OS (or tool) that generated that encoding.

Upvotes: 4

Related Questions