Reputation: 577
I need to generate file names from user inputted names. These names could be in any language. For example:
These are use inputted values, so I have no guarantee that the names don't contain characters that are invalid to be in file names.
Users will be downloading these files from their browser, so I need to ensure the file names are valid on all operating systems in all configurations.
I am currently doing this for English speaking countries by simply removing all non-alphanumeric characters with a simple regex:
string = string.replaceAll("[^a-zA-Z0-9]", "");
string = string.replaceAll("\\s+", "_")
Some example conversions:
Obviously this does not work internationally.
I've considered finding/generating a blacklist of all characters that are invalid on all file systems and stripping those from the names. I've been unable to find a comprehensive list.
I'd prefer to use existing code in a common library if possible. I imagine this is an already solved problem, however I can't find a solution that works internationally.
The filename is for the user downloading the file, not for me. I'm not going to be storing these files. These files are dynamically generated by the server upon request from data in a database. The filenames are for the convenience of the person downloading the file.
Upvotes: 6
Views: 5669
Reputation: 131445
Summarizing and paraphrasing @eee's answer...
String sanitizeFilename(String unsanitized) {
return unsanitized
.replaceAll("[\\?\\\\/:|<>\\*]", " ") // filter out ? \ / : | < > *
.replaceAll("\\s", "_"); // white space as underscores
}
(not joining multiple spaces into one!)
Upvotes: 0
Reputation: 718718
My advice would be to make it a requirement that your application runs on a platform that supports Unicode filenames. Most do these days.
I don't think it is feasible to map from Unicode to an (unspecified) restricted character set, while still retaining human readability AND the original meaning AND avoiding collisions. Indeed, it is not even possible to do this mapping from Latin-1 to ASCII.
If your application has to run on platforms that doesn't support Unicode filenames, then you will need to sacrifice human readability and/or meaning in the filenames in some cases. Besides, consider whether (for example) ASCII-ized chinese characters or Cyrilic letters or letters with accents stripped off are going to be acceptable to your end users.
What I'd do is offer the user two options to select from:
An option that uses Unicode filenames for uploaded files. This should be the default, since most users' machines will support this.
A fallback option that uses generated names which are not related to the original strings / text.
In reality, if the user's machine doesn't support Unicode, they are going to have huge problems dealing with textual names that are not encoded using the machine's native encoding. There's no completely reliable way to find out what that is. Even if you have a semi-reliable way of figuring that out ... on the server side ... the problem of mapping all of Unicode to that encoding is intractable.
It is better to encourage the user to upgrade his / her operating system to a Unicode capable one.
Upvotes: 0
Reputation: 4000
Regex [^a-zA-Z0-9]
will filter non-ASCII characters which will omit Unicode characters or characters above 128 codepoints.
Assuming that you want to filter user input for valid file-names by replacing invalid file-name characters such as ? \ / : | < > *
with underscore (_
):
import java.io.UnsupportedEncodingException;
public class ReplaceI18N {
public static void main(String[] args) {
String[] names = {
"John Smith",
"高岡和子",
"محمد سعيد بن عبد العزيز الفلسطيني",
"|J:o<h>n?Sm\\it/h*",
"高?岡和\\子*",
"محمد /سعيد بن عبد ?العزيز :الفلسطيني\\"
};
for(String s: names){
String u = s;
try {
u = new String(s.getBytes(), "UTF-8");
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
u = u.replaceAll("[\\?\\\\/:|<>\\*]", " "); //filter ? \ / : | < > *
u = u.replaceAll("\\s+", "_");
System.out.println(s + " = " + u);
}
}
}
The output:
John Smith = John_Smith
高岡和子 = 高岡和子
محمد سعيد بن عبد العزيز الفلسطيني = محمد_سعيد_بن_عبد_العزيز_الفلسطيني
|J:o<h>n?Sm\it/h* = _J_o_h_n_Sm_it_h_
高?岡和\子* = 高_岡和_子_
محمد /سعيد بن عبد ?العزيز :الفلسطيني\ = محمد_سعيد_بن_عبد_العزيز_الفلسطيني_
The valid filenames even with Unicode characters will be displayable on any webpage that supports UTF-8 encoding with the correct Unicode font.
In addition, each will be the correct name for its file on any OS file-system that supports Unicode (tested OK on Windows XP, Windows 7).
But, if you want to pass each valid filename as a URL string, make sure to encode it properly using URLEncoder
and later decode each encoded URL using URLDecoder
.
Upvotes: 4
Reputation: 54242
Windows appears to support unicode file names, I know Linux does, and apparently OS X does too. Presumably a well-written would fix invalid characters in a file name before saving it.
It seems like you should be able to just use unicode file names. Is there some OS or browser that this doesn't work on?
Upvotes: 0
Reputation: 798526
Encode the filename as UTF-8, and then URL-encode the result.
'高岡和子' -> '%E9%AB%98%E5%B2%A1%E5%92%8C%E5%AD%90'
Upvotes: 0
Reputation: 56457
Letting the input determine a file name without proper sanitizing seems prone to security attacks. You can use a hash function (SHA-1, MD5) to generate a valid filename. Just be aware that you can't derive the original name from the hash.
Also, if you can have a simple lookup table, you can assign special identifiers to the names (like sequential numbers or GUIDs), and use the identifier as the filename.
Another thing, have you thought about homonyms?
Upvotes: 0