Alex
Alex

Reputation: 6655

Converting UTF8 text for use in a URL

I'm developing an international site which uses UTF8 to display non english characters. I'm also using friendly URLS which contain the item name. Obviously I can't use the non english characters in the URL.

Is there some sort of common practice for this conversion? I'm not sure which english characters i should be replacing them with. Some are quite obvious (like è to e) but other characters I am not familiar with (such as ß).

Upvotes: 4

Views: 9231

Answers (5)

Kris
Kris

Reputation: 41857

Last time I tried (about a week ago), UTF-8 (specifically japanese) characters worked fine in URLs without any additional encoding. Even looked right in address bars across all browsers I tested with (Safari, Chrome and Firefox, all on Mac) and I have no idea what browser my girlfriend was using on windows. Aside from most windows installations i've run across just showing squares for japanese characters because they lack the required fonts to display them, it seems to work fine there as well.

The URL I tried is: http://www.webghoul.de.private-void.net/cache/black-f-with-あい-50.png (WMD does not seem to like it)

Proof by screenshot http://heavymetal.theredhead.nl/~kris/stackoverflow/screenshot-utf8-url.png

So it might not actually be allowed by the spec, for what i've seen it works well across the board, except maybe in editors that like the spec a lot ;-)

I wouldn't actually recommend using these types of characters in URLs, but I also wouldn't make it a first priority to "fix".

Upvotes: -1

Konrad Rudolph
Konrad Rudolph

Reputation: 545776

Obviously I can't use the non english characters in the URL.

In fact, you can. The Wikipedia software (built in PHP) supports this, e.g. en.wikipedia.org/wiki/☃.

Notice that you need to encode the URL appropriately, as shown in the other answers.

Upvotes: 3

Gumbo
Gumbo

Reputation: 655449

You can use UTF-8 encoded data in URL paths. You just need to encoded it additionally with the Percent encoding (see rawurlencode):

// ß (U+00DF) = 0xC39F (UTF-8)
$str = "\xC3\x9F";
echo '<a href="http://en.wikipedia.org/wiki/'.rawurlencode($str).'">'.$str.'</a>';

This will echo a link to http://en.wikipedia.org/wiki/ß. Modern browsers will display the character ß itself in the location bar instead of the percentage encoded representation of that character in UTF-8 (%C3%9F).

If you don’t want to use UTF-8 but only ASCII characters, I suggest to use transliteration like Álvaro G. Vicario suggested.

Upvotes: 6

&#193;lvaro Gonz&#225;lez
&#193;lvaro Gonz&#225;lez

Reputation: 146460

I normally use iconv() with the 'ASCII//TRANSLIT' option. This takes input like:

último año

and produces output like:

'ultimo a~no

Then I use preg_replace() to replace white spaces with dashes:

'ultimo-a~no

... and remove unwanted chars, e.g.

[^a-z0-9-]

It's probably useless with Arabic or Chinese but it works fine with Spanish, French or German.

Upvotes: 5

H&#229;vard S
H&#229;vard S

Reputation: 23876

Use rawurlencode to encode your name for the URL, and rawurldecode to convert the name in the URL back to the original string. These two functions convert strings to and from URLs in compliance with RFC 1738.

Upvotes: 2

Related Questions