DanielGibbs
DanielGibbs

Reputation: 10190

URL decoding Japanese characters etc. in Java

I have a servlet that receives some POST data. Because this data is x-www-form-urlencoded, a string such as サボテン would be encoded to サボテン.

How would I unencode this string back to the correct characters? I have tried using URLDecoder.decode("encoded string", "UTF-8"); but it doesn't make a difference.

The reason I would like to unencode them, is because, before I display this data on a webpage, I escape & to & and at the moment, it is escaping the &s in the encoded string so the characters are not showing up properly.

Upvotes: 3

Views: 5286

Answers (4)

irreputable
irreputable

Reputation: 45453

This is a feature/bug of browsers. If a web page is in a limited charset, say ASCII, and users type in some chars outside the charset in a form field, browsers will send these chars in the form of $#xxxx;

It can be a problem because if users actually type $#xxxx; they'll be sent as is. So the server has no way to distinguish the two cases.

The best way is to use a charset that covers all characters, like UTF-8, so browsers won't do this trick.

Upvotes: 1

Byron Whitlock
Byron Whitlock

Reputation: 53921

How about a regular expression?

Pattern pattern = Pattern.compile("&([^a][^m][^p][^;])?");
Matcher matcher = pattern.matcher(inputStr);
String output = matcher.replaceAll("&$1");

Upvotes: 0

BalusC
BalusC

Reputation: 1109645

Those are not URL encodings. It would have looked like %E3%82%B5%E3%83%9C%E3%83%86%E3%83%B3. Those are decimal HTML/XML entities. To unescape HTML/XML entities, use Apache Commons Lang StringEscapeUtils.


Update as per the comments: you will get question marks when the response encoding is not UTF-8. If you're using JSP, just add the following line to top of the page:

<%@ page pageEncoding="UTF-8" %>

See for more detail the solutions about halfway this article. I would prefer using-UTF8-all-the-way above fiddling with regexps since regexps doesn't prepare you for world domination.

Upvotes: 5

rfeak
rfeak

Reputation: 8214

Just a wild guess, but are you using Tomcat?

If so, make sure you have set up the Connector in Tomcat with a URIEncoding of UTF-8. Google that on the web and you will find a ton of hits such as

How to get UTF-8 working in Java webapps?

Upvotes: 0

Related Questions