Reputation: 13
I'm trying to clean up lists of websites using regex. This is a sample line from the text file I will feed through the script:
419 pcpop.com IT 4,675
420 1234567.com.cn Finanace 4,512
424 shanxi.gov.cn Others 3,633
425 lss.gov.cn Others 5,513
426 meishij.net Local Information 5,450
the goal is to only pull the domains out:
meishij.net, shanxi.gov.cn, etc
This is what I have so far:
re.findall(r"\w+\.com|\.cn|\.ru|\.gov|\.cc|\.life|\.net|\.org", ...
Which works fine for .com:
['it168.com']
['alibaba.com']
['.cn']
['.cn']
but any other top level domains besides .com only pulls the top level domain itself instead of the entire domain name. I thought using |
as OR
would work to cycle through top level domains to match.
Upvotes: 1
Views: 76
Reputation: 43169
Just use some old-fashioned but powerful string functions:
junk = """
419 pcpop.com IT 4,675
420 1234567.com.cn Finanace 4,512
424 shanxi.gov.cn Others 3,633
425 lss.gov.cn Others 5,513
426 meishij.net Local Information 5,450
"""
domains = [parts[1].strip()
for line in junk.split("\n") if line
for parts in [line.split()] if len(parts) > 1]
print(domains)
Which yields
['pcpop.com', '1234567.com.cn', 'shanxi.gov.cn', 'lss.gov.cn', 'meishij.net']
If you insist, you'd need to form a non-capturing group around your alternations:
re.findall(r"\w+(?:\.com|\.cn|\.ru|\.gov|\.cc|\.life|\.net|\.org)")
# ^^^
Upvotes: 1