materialAnywhere
materialAnywhere

Reputation: 13

Regex for Domains?

I'm trying to clean up lists of websites using regex. This is a sample line from the text file I will feed through the script:

419     pcpop.com   IT  4,675
420     1234567.com.cn      Finanace    4,512
424     shanxi.gov.cn   Others  3,633
425     lss.gov.cn      Others  5,513
426     meishij.net     Local Information   5,450

the goal is to only pull the domains out:

meishij.net, shanxi.gov.cn, etc

This is what I have so far:

re.findall(r"\w+\.com|\.cn|\.ru|\.gov|\.cc|\.life|\.net|\.org", ...

Which works fine for .com:

['it168.com']
['alibaba.com']
['.cn']
['.cn']

but any other top level domains besides .com only pulls the top level domain itself instead of the entire domain name. I thought using | as OR would work to cycle through top level domains to match.

Upvotes: 1

Views: 76

Answers (1)

Jan
Jan

Reputation: 43169

Just use some old-fashioned but powerful string functions:

junk = """
419     pcpop.com   IT  4,675
420     1234567.com.cn      Finanace    4,512
424     shanxi.gov.cn   Others  3,633
425     lss.gov.cn      Others  5,513
426     meishij.net     Local Information   5,450
"""

domains = [parts[1].strip()
           for line in junk.split("\n") if line
           for parts in [line.split()] if len(parts) > 1]
print(domains)

Which yields

['pcpop.com', '1234567.com.cn', 'shanxi.gov.cn', 'lss.gov.cn', 'meishij.net']

If you insist, you'd need to form a non-capturing group around your alternations:

re.findall(r"\w+(?:\.com|\.cn|\.ru|\.gov|\.cc|\.life|\.net|\.org)")
#              ^^^

Upvotes: 1

Related Questions