Reputation: 6655
I'm developing an international site which uses UTF8 to display non english characters. I'm also using friendly URLS which contain the item name. Obviously I can't use the non english characters in the URL.
Is there some sort of common practice for this conversion? I'm not sure which english characters i should be replacing them with. Some are quite obvious (like è to e) but other characters I am not familiar with (such as ß).
Upvotes: 4
Views: 9231
Reputation: 41857
Last time I tried (about a week ago), UTF-8 (specifically japanese) characters worked fine in URLs without any additional encoding. Even looked right in address bars across all browsers I tested with (Safari, Chrome and Firefox, all on Mac) and I have no idea what browser my girlfriend was using on windows. Aside from most windows installations i've run across just showing squares for japanese characters because they lack the required fonts to display them, it seems to work fine there as well.
The URL I tried is: http://www.webghoul.de.private-void.net/cache/black-f-with-あい-50.png (WMD does not seem to like it)
Proof by screenshot http://heavymetal.theredhead.nl/~kris/stackoverflow/screenshot-utf8-url.png
So it might not actually be allowed by the spec, for what i've seen it works well across the board, except maybe in editors that like the spec a lot ;-)
I wouldn't actually recommend using these types of characters in URLs, but I also wouldn't make it a first priority to "fix".
Upvotes: -1
Reputation: 545776
Obviously I can't use the non english characters in the URL.
In fact, you can. The Wikipedia software (built in PHP) supports this, e.g. en.wikipedia.org/wiki/☃.
Notice that you need to encode the URL appropriately, as shown in the other answers.
Upvotes: 3
Reputation: 655449
You can use UTF-8 encoded data in URL paths. You just need to encoded it additionally with the Percent encoding (see rawurlencode
):
// ß (U+00DF) = 0xC39F (UTF-8)
$str = "\xC3\x9F";
echo '<a href="http://en.wikipedia.org/wiki/'.rawurlencode($str).'">'.$str.'</a>';
This will echo a link to http://en.wikipedia.org/wiki/ß. Modern browsers will display the character ß
itself in the location bar instead of the percentage encoded representation of that character in UTF-8 (%C3%9F
).
If you don’t want to use UTF-8 but only ASCII characters, I suggest to use transliteration like Álvaro G. Vicario suggested.
Upvotes: 6
Reputation: 146460
I normally use iconv() with the 'ASCII//TRANSLIT' option. This takes input like:
último año
and produces output like:
'ultimo a~no
Then I use preg_replace() to replace white spaces with dashes:
'ultimo-a~no
... and remove unwanted chars, e.g.
[^a-z0-9-]
It's probably useless with Arabic or Chinese but it works fine with Spanish, French or German.
Upvotes: 5
Reputation: 23876
Use rawurlencode
to encode your name for the URL, and rawurldecode
to convert the name in the URL back to the original string. These two functions convert strings to and from URLs in compliance with RFC 1738.
Upvotes: 2