Reputation: 931
I need to extract domains from a string. I have a valid regex, that has been tested however I cannot get it to work with the following code. Probably something obvious that I'm missing here
mytext = "I want to extract some domains like foo.com, bar.net or http://foobar.net/ etc"
myregex = r'^([a-zA-Z0-9]([a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}$'
foo = re.findall(myregex, mytext)
print foo
I just prints out an empty list when I want something like
['foo.com','bar.net','foobar.net']
Thank you.
Upvotes: 1
Views: 5861
Reputation: 754715
The problem is the inclusion of ^
at the start and $
at the end of the regex. This makes it match only when the domain is the entire string. Here you want to see matches within the string. Try changing it like so
myregex = r'(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}'
EDIT
@Martijn pointed out that non-capturing groups needed to be used here to get the specified output.
Upvotes: 0
Reputation: 1121834
Remove the anchors, and make the groups not capture:
r'(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]{,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}'
The ^
and $
locked your expression to match whole strings only. re.findall()
also changes behaviour when the pattern contains capturing groups; you want to list the whole match here which requires there to be no such groups. (...)
is a capturing group, (?:...)
is a non-capturing group.
Demo:
>>> myregex = r'(?:[a-zA-Z0-9](?:[a-zA-Z0-9\-]{,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}'
>>> re.findall(myregex, mytext)
['foo.com', 'bar.net', 'foobar.net']
Upvotes: 7
Reputation: 2080
The problem here is that your regex includes ^ at the beginning and $ at the end, meaning it only matches a domain that both starts and ends the string (ie just a domain).
For example, it will match "www.stackoverflow.com" but not "this is a question on www.stackoverflow.com" or "www.stackoverflow.com is great".
It should work fine if you just remove ^ and $ from the regex. Here's a small example
Upvotes: 0