Reputation: 2949
I have a script that works with different sites. As a result this script returns 1 csv file for 1 site with a unique filename, based on site URL. Site URL may be different, like
http://test1.com
http://test2.com/testurl
http://test3.com/test/path/
I want to convert URLs to filenames - to remove all characters that can cause conflict in Linux/Windows, to replace them with '_', for example
http://test1.com will be test1com.csv
http://test2.com/testurl will be test2comtesturl.csv
http://test3.com/test/path/ will be test3comtestpath.csv
I can try to use parse_url and concat host and path with replacing '/' and '.' to '_', but I'm not sure that this is the best solution, because URLs can be different and with different characters that can not be used as filename.
Upvotes: 1
Views: 624
Reputation: 1
For anyone who uses python, I write a simple function
import string
class URLProcessor:
def __init__(self, ):
prohibited_fp_character = """#%&{{}}\<>*?/ $!'":@"""
prohibited_fp_character_to_id = {}
atoz = string.ascii_lowercase
for id, c in enumerate(prohibited_fp_character):
prohibited_fp_character_to_id[c] = "▁"+atoz[id]
self.prohibited_fp_character_to_id = prohibited_fp_character_to_id
self.id_to_prohibited_fp_character = {v: k for k,v in self.prohibited_fp_character_to_id.items()}
def url_to_filename(self, url):
fp = ""
for c in url:
if c in self.prohibited_fp_character_to_id:
fp += self.prohibited_fp_character_to_id[c]
else:
fp += c
return fp
def filename_to_url(self, filename):
url = ""
repl = 0
for c in filename:
if c == "▁":
repl = 1
elif repl == 1:
url += self.id_to_prohibited_fp_character["▁"+c]
repl = 0
else:
url += c
return url
Upvotes: 0
Reputation: 37701
You can make a list of URL-safe characters and convert any character that's not in the list into _
.
Just be careful of duplicates (site.com/test/x and site.com/text.x, for example), if any. Find a way to handle them.
Upvotes: 0