samach
samach

Reputation: 3394

Working with UTF-8 charset in php

I have been struggling with the UTF-8 charset for quite a while now, and I am still confused about some things.

I have a web page which allow clients to create HTML files and directories on server. The directory name can be in any language. Adiós, días, chapter, level etc. The directories created are later on used as a URL for the HTML files created. Let’s say the user created a directory Adiós and then a file called welcome.html. To view this file, the client clicks a link and for that I get the directory and file name to create a path Adiós/welcome.html. Now I am confused about these things.

  1. When making the directory in php, should I urlencode() every file and directory name?

  2. If I do urlencode the directory name, will the browser be able to open my HTML page? Instead of href="Adiós/welcome.html" it will be href="Adi%C3%B3s/welcome.html".

  3. There’s sometimes an image on my web page which I will src as "Adi%C3%B3s/ing.jpg"; is this going to work?

  4. Should the url in address bar show non‐ASCII characters?

I actually urlencode()d everything but ran into issues as described in point 2 and 3, so I wanted to know what the right approach is for directory naming when working with languages other than English!

Upvotes: 0

Views: 345

Answers (3)

Your Common Sense
Your Common Sense

Reputation: 157828

I have a web page which allow clients to create html files and folders on server.

That's wrong idea.
Store their files in the database and emulate directory structure as well.

EDIT because of these silly accusations in the comments I have to clarify:

I am talking of this very case of HTML files with fancy names in particular, not of binary files in general.

satisfied?

Upvotes: 0

feeela
feeela

Reputation: 29932

  1. That depends on the underlying OS (IMHO Linux is capable of handling UTF-8 filenames, Windows is not)
  2. normally a browser should simply request and open files like /tülüvkrü.htm, I don't how MS IE handles such things;
  3. [same as second]
  4. sure, if the filename does contain them; as stated for 2. and 3., this depends on the used browser;

Example: http://tülüvkrü.de/中华人民共和国.htm (should display "It works!")

Upvotes: 1

Artefacto
Artefacto

Reputation: 97805

If you save the names urlencoded in the filesystem, you must double urlencode the links and image sources if you want to access them directly, bypassing PHP. Alternatively, you could save the names without any kind of urlencoding, in which case the links would need one pass. However, this last option isn't available on Windows, where Unicode is not supported in the filesystem functions.

Alternatively, if you still want to bypass PHP, you can use rewrite rules to reencode the names once they have urldecoded by Apache.

Finally, you should take note that your approach is dangerous -- difficult to get right without security implications. You should consider have a single PHP file serving your pages and saving them in a database. You could still keep pretty filenames by using the PATH_INFO variable. You could also add a caching layer in front of PHP if performance becomes an issue with this solution.

Upvotes: 1

Related Questions