Pawel Krakowiak
Pawel Krakowiak

Reputation: 10100

Non-Latin characters in URLs - is it better to encode them or replace with their Latin "counterparts"?

We're implementing a blog for a site which supports six different languages and five of them have non-Latin characters in their alphabets. We are not sure whether we should have them encoded (that is what we're doing at the moment)

Létání s potravinami: Co je dovoleno? becomes l%c3%a9t%c3%a1n%c3%ad-s-potravinami-co-je-dovoleno and the browser displays it as létání-s-potravinami-co-je-dovoleno.

or if we should replace them with their Latin "counterparts" (similar looking letters)

Létání s potravinami: Co je dovoleno? becomes letani-s-potravinami-co-je-dovoleno.

I can't find a definitive answer as to what's better from SEO perspective? Search engine optimization is very important for us. Which approach would you suggest?

Upvotes: 3

Views: 3854

Answers (7)

Quuxplusone
Quuxplusone

Reputation: 27342

This question is from 14 years ago. Today we have many blogs and sites using non-latin characters. So what is the approach used today? How do sites and services avoid hitting any url length limits that can easily be reached when every letter is %expanded?

I'm no authority, but here's two points that should be "obvious" but haven't been mentioned by any answer so far. And two caveats.

(1) You can totally put UTF-8 in the path part of your URLs. For example: https://en.wiktionary.org/wiki/létání . It's probably not standard, but all browsers (and all further-upstream machinery) obviously have to handle that kind of input, and so they do. How they handle it (for example Chrome seems to just %-encode automatically before making the request) seems out of scope for this question; my point is just that you don't have to choose either of the two suboptimal approaches (manual %-encoding or English transliteration) — it'll be OK to just do the natural thing.

(2) There are well-defined Unicode algorithms for string collation and comparison. You can see them in action in your browser right now: Ctrl+F and search for "letani" in this page. Notice that both "létání" and "letani" are highlighted as matches. Now search for "létání"; again, notice that both "létání" and "letani" are highlighted. So whoever's processing your text for spidering purposes will certainly have all the tools at their disposal to connect someone searching for "letani" with your page on "létání", or vice versa...

(3) ...but, I admit, I don't know what I'd do about accent homographs if I were running a search engine. Someone searching for information on "congres" might not want to see search results about the "congrès," and vice versa. It seems prudent at least to use accents correctly within your body text and headlines. I'm not sure anyone would be indexing on the URL path itself either way.

(4) Finally, since about 2022, I doubt any of this matters anymore. Who are you "SEO"'ing for — Google? They don't even do real "search results" anymore; it's just a bunch of LLM-generated "cards" encouraging the reader to buy something or click through to one of their social-media properties. If it's for a specific old-fashioned search engine, like, I dunno, Kagi, they might publish information on what they care about and/or how they work. (But I don't see Kagi doing that anywhere right now.)

Upvotes: 1

David Thornley
David Thornley

Reputation: 57066

Another issue is that there are Unicode code points whose glyphs look very much alike in most fonts, which is absolutely ideal for phishers. Stick to ASCII and the glyphs are visibly different when the characters are.

Upvotes: 0

austin cheney
austin cheney

Reputation:

In accordance with the URI specification, RFC 3986, only 7bit ASCII characters are allowed, and characters among those mentioned in the specification as control characters must be properly escaped. If you want to represent other characters or URI control characters then you should be using IRI, RFC 3987. Keep in mind that HTTP is not compatible with IRI, however.

When in doubt RTFM.

Upvotes: 0

dzhi
dzhi

Reputation: 1674

Pawel, first of all, you should decide whether you're going to optimize for global Google (google.com) or Polish one.

Upvotes: 0

S.Lott
S.Lott

Reputation: 392012

"what's better from SEO perspective"

Who's your audience? Americans who think all those extra letters are a mistake?

Or folks who read (and search) for "non-ASCII" letters because those non-ASCII letters are part of their language?

SEO is a bad thing to chase. Complete, correct, consistent and usable is what you what to build first.

Upvotes: 2

Adam Kiss
Adam Kiss

Reputation: 11859

Most of the times, search engines deal with latin counterparts good, although sometimes, results for i.e. "létání" and "letani" slightly differ.

So, in terms of SEO, almost no harm is done - once your site has good content, good markup and all that other stuff, your site won't suffer from having latin URLs.

You don't always know what combination of system browser and plugins users use, so make them as easy as possible - all websites use standard latin in URLs, because non-latin symbols can choke anything from server through browser to any plugin that might break user's experience.

And I can't stress this enough; Users before SEO!

Upvotes: 5

Amine ABDALKHALKI
Amine ABDALKHALKI

Reputation: 445

well i suggest you to replace them with there latin counterparts because it's user friendly and your website will be accessible on every single computer (as the keyboard changes from computer to another but all of them have latins letters), but for SEO perspective i don't think it's gonna be a problem.

Upvotes: 0

Related Questions