Reputation: 2633
the base of this question comes from the fact that in many latin languages, and also in many non-latin languages there are letters that from what I've seen, up until recently were not really usable in URLs and nearly always ended up generating a big bunch of URL encoded characters.
But, recently I've seen several sites using native letters in URLs (except for domain).
Something like this for example using spanish accented letters:
https://www.example.com/esta-es-una-frase-en-español
https://www.example.com/cómo-usar-acentos-y-la-letra-ñ-en-urls
Also, I've seen URLs like
https://www.example.com/урл-на-български
From what I remember in terms of experience, not so long ago one had to either encode or convert accented characters to non-accented ones.
But now you can use this type of URL in the browser and it makes no issue and the letters appear as they should (not URL-encoded).
Is it safe to assume that now my URLs can handle these characters?
Also, is there any difference in terms of URL indexing for Google?
Upvotes: 3
Views: 2243
Reputation: 773
What Google says (reference):
Recommended: Use UTF-8 encoding as necessary.
And:
Not recommended: Using non-ASCII characters in the URL
Google wants you to encode the URLs in the anchors, yet you are free with the title of the anchor.
As you have mentioned Wikipedia, all URLs in the anchors are in fact encoded, but of course the browsers and their servers handle the encoded URLs. Attached example from Wikipedia:
Upvotes: 0
Reputation: 595369
URIs/URLs, as defined by RFC 3986 "Uniform Resource Identifier (URI): Generic Syntax", do not allow unencoded non-ASCII characters. Such characters must be charset-encoded (usually to UTF-8) and the resulting byte octets are then percent-encoded. If a browser is given a URL with unencoded Unicode characters in it, the browser will typically url-encode it properly behind the scenes when transmitting it to a web server. You can verify this with your browser's built-in debugger (if it has one) or an HTTP/S sniffer.
IRIs, as defined by RFC 3987 "Internationalized Resource Identifiers (IRIs)", do allow unencoded Unicode characters. IRIs are not in widespread use yet, however IRIs can maintain backwards compatibility by mapping to/from encoded URIs/URLs. It is possible that your browser may be treating the content of the address bar as an IRI, converting it to/from an URI/URL internally as needed.
Upvotes: 2