Chris Hall
Chris Hall

Reputation: 931

How To Extract All Domains From Texts?

I need to extract domains from a string. I have a valid regex, that has been tested however I cannot get it to work with the following code. Probably something obvious that I'm missing here

mytext = "I want to extract some domains like foo.com, bar.net or http://foobar.net/ etc"
myregex = r'^([a-zA-Z0-9]([a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}$'
foo = re.findall(myregex, mytext)
print foo

I just prints out an empty list when I want something like

['foo.com','bar.net','foobar.net']

Thank you.

Upvotes: 1

Views: 5861

Answers (3)

JaredPar
JaredPar

Reputation: 754715

The problem is the inclusion of ^ at the start and $ at the end of the regex. This makes it match only when the domain is the entire string. Here you want to see matches within the string. Try changing it like so

myregex = r'(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}'

EDIT

@Martijn pointed out that non-capturing groups needed to be used here to get the specified output.

Upvotes: 0

Martijn Pieters
Martijn Pieters

Reputation: 1121834

Remove the anchors, and make the groups not capture:

r'(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]{,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}'

The ^ and $ locked your expression to match whole strings only. re.findall() also changes behaviour when the pattern contains capturing groups; you want to list the whole match here which requires there to be no such groups. (...) is a capturing group, (?:...) is a non-capturing group.

Demo:

>>> myregex = r'(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]{,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}'
>>> re.findall(myregex, mytext)
['foo.com', 'bar.net', 'foobar.net']

Upvotes: 7

Callum M
Callum M

Reputation: 2080

The problem here is that your regex includes ^ at the beginning and $ at the end, meaning it only matches a domain that both starts and ends the string (ie just a domain).

For example, it will match "www.stackoverflow.com" but not "this is a question on www.stackoverflow.com" or "www.stackoverflow.com is great".

It should work fine if you just remove ^ and $ from the regex. Here's a small example

Upvotes: 0

Related Questions