Reputation: 22252
Is there a way to turn arbitrary user input names into safe filenames with an encoding that is reversible?
I have some data files that belong to entities that users named. Of course, they can do silly things like put invalid filesystem characters in their names.
The two suggestions I see frequently for this are:
A) Base64 encode them
B) Strip illegal characters
Base64 is reversible, but for debugging/introspection, it's really nice when the file names look as much like the names as possible. Just keeps everything more debuggable. Approach B isn't reversible, so the "actual" name has to be stored redundantly anyway, so there's no real value in not just using a uuid or something.
This if specifically for Linux. While this isn't python specific, that's what I'm implementing it in.
Upvotes: 2
Views: 1614
Reputation: 148
You could URL-encode the string provided by the user.
According the Wikipedia article on Percent Encoding (which itself quotes RFC 3986), the only URL-safe characters are A-Z, a-z, 0-9, dash, underscore, dot, and tilde (~). Tilde has a unique interpretation in the shell, but it's not illegal for Linux filenames.
It looks like URL-encoding is pretty easy in Python with urllib(2), but I'm not a Python programmer.
See: URL encoding/decoding with Python
Upvotes: 1
Reputation: 1121594
You could use URL encoding:
from urllib.parse import quote
safefilename = quote(filename, safe='')
This is fully round-trippable, and keeps ASCII characters readable:
>>> from urllib.parse import quote, unquote
>>> quote('foo/../bar', safe='')
'foo%2F..%2Fbar'
>>> unquote(quote('foo/../bar', safe=''))
'foo/../bar'
Do set safe
to the empty string; the default is '/'
so slashes are not normally escaped.
Upvotes: 4