Reputation: 75
I have a large text file of URLs (>1 million URLs). The URLs represent product pages across several different domains.
I'm trying to parse out the SKU and product name from each URL, such as:
I already have the individual regex patterns figured out for parsing out the two components of the URL (product name and SKU) for all of the domains in my list. This is nearly 100 different patterns.
While I've figured out how to test this one URL/pattern at a time, I'm having trouble figuring out how to architect a script which will read in my entire list, then go through and parse each line based on the relevant regex pattern. Any suggestions how to best tackle this?
If my input is one column (URL), my desired output is 4 columns (URL, domain, product_name, SKU).
Upvotes: 0
Views: 285
Reputation: 1199
Since it's fairly easy to extract domain name from the URL, you can map domain name to the pattern for that domain.
Like this:
dict = {
'domain1.com': 'regex_pattern_for_domain1',
'domain2.com': 'regex_pattern_for_domain2'
}
Now you should read your file line by line and apply general regex for extracting domain name which you will use to get the specific regex.
def extract_data(url, regex_pattern):
# code to extract product name and SKU
return ['product_id', 'sku']
def extract_domain(url):
# apply general regex pattern to extract URL
return 'domain name'
parsed_data = []
with open('urls.txt') as f:
url = f.readline()
domain = extract_domain(url) # call function that extracts domain from the URL
domain_regex = dict[domain] # use dictionary to get the regex for the given domain
data = extract_data(url, domain_regex) # call function to extract data from the given URL and regex for that domain
data.append(domain)
data.append(url)
parsed_data.append(data) # append extracted data to the list, or save it to another file if it is too big to fit into memory.
Upvotes: 2
Reputation: 11681
While it is possible to roll this all into one massive regex, that might not be the easiest approach. Instead, I would use a two-pass strategy. Make a dict of domain names to the regex pattern that works for that domain. In the first pass, detect the domain for the line using a single regex that works for all URLs. Then use the discovered domain to lookup the appropriate regex in your dict to extract the fields for that domain.
Upvotes: 1