Edi
Edi

Reputation: 327

What character set should be used for URL encoding?

I need to encode a URL component. The URL component can contain special character like "?,#,=" and also characters of Chinese language.

Which of the character sets should I use: UTF-8, UTF-16 or UTF-32? and why?

Upvotes: 4

Views: 7012

Answers (5)

holmis83
holmis83

Reputation: 16604

A reference from a HTML point of view.

The HTML4 specification, section Non-ASCII characters in URI attribute values, states (my emphasis):

We recommend that user agents adopt the following convention for handling non-ASCII characters in such cases:

  1. Represent each character in UTF-8 (see [RFC2279]) as one or more bytes.
  2. Escape these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of the byte value).

Similar, in HTML5 specification, the Selecting a form submission encoding section, basically says that UTF-8 should be used if no accept-charset attribute is specified.

On the other hand, I found nothing that states UTF-8 must be used. Some older software use iso-8859-1 in particular. For example, Apache Tomcat before version 8 has iso-8859-1 as default value for its URIEncoding setting.

Upvotes: 2

fge
fge

Reputation: 121702

I suppose you mean percent encoding here.

RFC 3986, section 2.5 is pretty clear about this (emphasis mine):

When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent- encoded. For example, the character A would be represented as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be represented as "%C3%80", and the character KATAKANA LETTER A would be represented as "%E3%82%A2".

Therefore, this should be UTF-8.

Also, beware of URLEncoder.encode(); while the recommendation for it is repeatedly repeated, the fact is that it is not suitable for URI encoding; quoting the javadoc of the class itself:

This class contains static methods for converting a String to the application/x-www-form-urlencoded MIME format

which is not what URI encoding uses. (in case you are wondering, application/x-www-form-urlencoded is what is used in HTTP POST data) What you want to use is a URI template instead. See for instance here.

Upvotes: 5

Chirag Visavadiya
Chirag Visavadiya

Reputation: 567

Go for UTF-8, also you can achieve the same thing by URLEncoder.encode(string, encoding)

In addition, you can refer This blog, It tried to encode some Chinese characters like '维也纳恩斯特哈佩尔球场'

Upvotes: 0

AuthenticReplica
AuthenticReplica

Reputation: 870

Encode your URL to escape special characters. There are several websites that can do this for you. E.g. http://www.url-encode-decode.com/

Upvotes: -2

wblades
wblades

Reputation: 64

UTF-8 (Unicode) is the default character encoding in HTML5, as it encompasses almost all symbols/characters.

Upvotes: 0

Related Questions