freento
freento

Reputation: 2949

Convert URL to correct filename (Linux/Windows)

I have a script that works with different sites. As a result this script returns 1 csv file for 1 site with a unique filename, based on site URL. Site URL may be different, like

http://test1.com
http://test2.com/testurl
http://test3.com/test/path/

I want to convert URLs to filenames - to remove all characters that can cause conflict in Linux/Windows, to replace them with '_', for example

http://test1.com will be test1com.csv
http://test2.com/testurl will be test2comtesturl.csv
http://test3.com/test/path/ will be test3comtestpath.csv

I can try to use parse_url and concat host and path with replacing '/' and '.' to '_', but I'm not sure that this is the best solution, because URLs can be different and with different characters that can not be used as filename.

Upvotes: 1

Views: 624

Answers (2)

Shengding Hu
Shengding Hu

Reputation: 1

For anyone who uses python, I write a simple function

import string

class URLProcessor:
    def __init__(self, ):
        prohibited_fp_character = """#%&{{}}\<>*?/ $!'":@"""
        prohibited_fp_character_to_id = {}
        atoz = string.ascii_lowercase
        for id, c in enumerate(prohibited_fp_character):
            prohibited_fp_character_to_id[c] = "▁"+atoz[id]
        self.prohibited_fp_character_to_id = prohibited_fp_character_to_id
        self.id_to_prohibited_fp_character = {v: k for k,v in self.prohibited_fp_character_to_id.items()}

    def url_to_filename(self, url):
        fp = ""
        for c in url:
            if c in self.prohibited_fp_character_to_id:
                fp += self.prohibited_fp_character_to_id[c]
            else:
                fp += c
        return fp
    
    def filename_to_url(self, filename):
        url = ""
        repl = 0
        for c in filename:
            if c == "▁":
                repl = 1
            elif repl == 1:
                url += self.id_to_prohibited_fp_character["▁"+c]
                repl = 0
            else:
                url += c
        return url

Upvotes: 0

Shomz
Shomz

Reputation: 37701

You can make a list of URL-safe characters and convert any character that's not in the list into _.

Just be careful of duplicates (site.com/test/x and site.com/text.x, for example), if any. Find a way to handle them.

Upvotes: 0

Related Questions