Parsing list of URLs with regex patterns

Question

I have a large text file of URLs (>1 million URLs). The URLs represent product pages across several different domains.

I'm trying to parse out the SKU and product name from each URL, such as:

www.amazon.com/totes-Mens-Mike-Duck-Boot/dp/B01HQR3ODE/
- totes-Mens-Mike-Duck-Boot
- B01HQR3ODE
www.bestbuy.com/site/apple-airpods-white/5577872.p?skuId=5577872
- apple-airpods-white
- 5577872

I already have the individual regex patterns figured out for parsing out the two components of the URL (product name and SKU) for all of the domains in my list. This is nearly 100 different patterns.

While I've figured out how to test this one URL/pattern at a time, I'm having trouble figuring out how to architect a script which will read in my entire list, then go through and parse each line based on the relevant regex pattern. Any suggestions how to best tackle this?

If my input is one column (URL), my desired output is 4 columns (URL, domain, product_name, SKU).

Boy · Accepted Answer

Since it's fairly easy to extract domain name from the URL, you can map domain name to the pattern for that domain.

Like this:

dict = {
'domain1.com': 'regex_pattern_for_domain1', 
'domain2.com': 'regex_pattern_for_domain2'
}

Now you should read your file line by line and apply general regex for extracting domain name which you will use to get the specific regex.

def extract_data(url, regex_pattern):
    # code to extract product name and SKU
    return ['product_id', 'sku'] 

def extract_domain(url):
    # apply general regex pattern to extract URL
    return 'domain name'

parsed_data = []
with open('urls.txt') as f:
    url = f.readline()
    domain = extract_domain(url) # call function that extracts domain from the URL
    domain_regex = dict[domain] # use dictionary to get the regex for the given domain
    data = extract_data(url, domain_regex) # call function to extract data from the given URL and regex for that domain
    data.append(domain)
    data.append(url)
    parsed_data.append(data) # append extracted data to the list, or save it to another file if it is too big to fit into memory.

Parsing list of URLs with regex patterns

Answers (2)

Related Questions