Why does the default deprecated java.net.URLEncoder.encode work but not when I specify a charset?

Question

I'm parsing some image links on wikipedia. I came across this one on http://en.wikipedia.org/wiki/Special:Export/Diego_Forl%C3%A1n

When i use the deprecated URLEncoder.encode, i can encode accented chars correctly, but when i specify the "UTF-8" argument, it fails. The text on wikipedia is utf8 AFAIK.

Diego+Forl%C3%A1n+vs+the+Netherlands.jpg is correct whereas Diego+Forl%E2%88%9A%C2%B0n+vs+the+Netherlands.jpg is incorrect.

scala> first
res24: String = Diego Forlán vs the Netherlands.jpg

scala> java.net.URLEncoder.encode(first, "UTF-8")
res25: java.lang.String = Diego+Forl%E2%88%9A%C2%B0n+vs+the+Netherlands.jpg

scala> java.net.URLEncoder.encode(first)
:33: warning: method encode in object URLEncoder is deprecated: see corresponding Javadoc for more information.
              java.net.URLEncoder.encode(first)
                                  ^
res26: java.lang.String = Diego+Forl%C3%A1n+vs+the+Netherlands.jpg

McDowell · Accepted Answer

I would guess that first is already corrupt and is only rendering correctly due to a transcoding bug hidden by your console configuration.

You can confirm this by emitting the UTF-16 code units in the string:

for(c<-first.toCharArray()){print("\u%04x".format(c.toInt))}

There is probably a more elegant way to write that.

If the code point is encoded correctly, it will be:

U+00e1      á       \u00e1

I expect somewhere UTF-8 encoded data is being decoded using a MacRoman decoder.

codepoint   glyph   escaped    x-MacRoman     info
=======================================================================
U+221a      √       \u221a     c3,            MATHEMATICAL_OPERATORS, MATH_SYMBOL
U+00b0      °       \u00b0     a1,            LATIN_1_SUPPLEMENT, OTHER_SYMBOL

Why does the default deprecated java.net.URLEncoder.encode work but not when I specify a charset?

Answers (1)

Related Questions