Mihail Minkov
Mihail Minkov

Reputation: 2633

Usage of unicode characters in URL

the base of this question comes from the fact that in many latin languages, and also in many non-latin languages there are letters that from what I've seen, up until recently were not really usable in URLs and nearly always ended up generating a big bunch of URL encoded characters.

But, recently I've seen several sites using native letters in URLs (except for domain).

Something like this for example using spanish accented letters:

https://www.example.com/esta-es-una-frase-en-español
https://www.example.com/cómo-usar-acentos-y-la-letra-ñ-en-urls

Also, I've seen URLs like

https://www.example.com/урл-на-български

From what I remember in terms of experience, not so long ago one had to either encode or convert accented characters to non-accented ones.

But now you can use this type of URL in the browser and it makes no issue and the letters appear as they should (not URL-encoded).

Is it safe to assume that now my URLs can handle these characters?

Also, is there any difference in terms of URL indexing for Google?

Upvotes: 3

Views: 2243

Answers (2)

Omar N Shamali
Omar N Shamali

Reputation: 773

What Google says (reference):

Recommended: Use UTF-8 encoding as necessary.

And:

Not recommended: Using non-ASCII characters in the URL

Google wants you to encode the URLs in the anchors, yet you are free with the title of the anchor.

As you have mentioned Wikipedia, all URLs in the anchors are in fact encoded, but of course the browsers and their servers handle the encoded URLs. Attached example from Wikipedia:

enter image description here

Upvotes: 0

Remy Lebeau
Remy Lebeau

Reputation: 595369

URIs/URLs, as defined by RFC 3986 "Uniform Resource Identifier (URI): Generic Syntax", do not allow unencoded non-ASCII characters. Such characters must be charset-encoded (usually to UTF-8) and the resulting byte octets are then percent-encoded. If a browser is given a URL with unencoded Unicode characters in it, the browser will typically url-encode it properly behind the scenes when transmitting it to a web server. You can verify this with your browser's built-in debugger (if it has one) or an HTTP/S sniffer.

IRIs, as defined by RFC 3987 "Internationalized Resource Identifiers (IRIs)", do allow unencoded Unicode characters. IRIs are not in widespread use yet, however IRIs can maintain backwards compatibility by mapping to/from encoded URIs/URLs. It is possible that your browser may be treating the content of the address bar as an IRI, converting it to/from an URI/URL internally as needed.

Upvotes: 2

Related Questions