Reputation: 327
I need to encode a URL component. The URL component can contain special character like "?,#,=" and also characters of Chinese language.
Which of the character sets should I use: UTF-8, UTF-16 or UTF-32? and why?
Upvotes: 4
Views: 7012
Reputation: 16604
A reference from a HTML point of view.
The HTML4 specification, section Non-ASCII characters in URI attribute values, states (my emphasis):
We recommend that user agents adopt the following convention for handling non-ASCII characters in such cases:
- Represent each character in UTF-8 (see [RFC2279]) as one or more bytes.
- Escape these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of the byte value).
Similar, in HTML5 specification, the Selecting a form submission encoding section, basically says that UTF-8 should be used if no accept-charset
attribute is specified.
On the other hand, I found nothing that states UTF-8 must be used. Some older software use iso-8859-1 in particular. For example, Apache Tomcat before version 8 has iso-8859-1 as default value for its URIEncoding
setting.
Upvotes: 2
Reputation: 121702
I suppose you mean percent encoding here.
RFC 3986, section 2.5 is pretty clear about this (emphasis mine):
When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent- encoded. For example, the character A would be represented as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be represented as "%C3%80", and the character KATAKANA LETTER A would be represented as "%E3%82%A2".
Therefore, this should be UTF-8.
Also, beware of URLEncoder.encode()
; while the recommendation for it is repeatedly repeated, the fact is that it is not suitable for URI encoding; quoting the javadoc of the class itself:
This class contains static methods for converting a String to the application/x-www-form-urlencoded MIME format
which is not what URI encoding uses. (in case you are wondering, application/x-www-form-urlencoded
is what is used in HTTP POST data) What you want to use is a URI template instead. See for instance here.
Upvotes: 5
Reputation: 567
Go for UTF-8, also you can achieve the same thing by URLEncoder.encode(string, encoding)
In addition, you can refer This blog, It tried to encode some Chinese characters like '维也纳恩斯特哈佩尔球场'
Upvotes: 0
Reputation: 870
Encode your URL to escape special characters. There are several websites that can do this for you. E.g. http://www.url-encode-decode.com/
Upvotes: -2
Reputation: 64
UTF-8 (Unicode) is the default character encoding in HTML5, as it encompasses almost all symbols/characters.
Upvotes: 0