Reputation: 865
I wish to get all the the domain names in the given string using python. i have tried the below but i am not getting the o/p as expected
str = "ctcO6OgnWRAxLtu+akRCFwM asu.edu zOiV6Wo6nDnUhQkZO4XTySrTRwLMgozM9R/LyQs2r+Pb tarantino.cs.ucsb.edu,128.111.48.123 ssh-rsa 9SMF4U+qJW03Bh1"
list = re.findall(r'([a-zA-Z0-9]+(-[a-zA-Z0-9]+)*\.)+[a-z]{2,10}', str)
print list
I want the output as:
asu.edu , tarantino.cs.ucsb.edu
but what I get is:
[('asu.', ''), ('ucsb.', '')]
What am I missing ?
Upvotes: 0
Views: 1183
Reputation: 880697
In [63]: text = "ctcO6OgnWRAxLtu+akRCFwM asu.edu zOiV6Wo6nDnUhQkZO4XTySrTRwLMgozM9R/LyQs2r+Pb tarantino.cs.ucsb.edu,128.111.48.123 ssh-rsa 9SMF4U+qJW03Bh1"
In [64]: re.findall(r'(?:[a-zA-Z0-9]+\.)+[a-z]{2,10}', text)
Out[64]: ['asu.edu', 'tarantino.cs.ucsb.edu']
Use (?:...)
to create a non-capturing group. When the pattern contains more than one grouping pattern (i.e. a pattern surrounded by parentheses), re.findall
returns a tuple for each match. To prevent re.findall
from returning a list of tuples, use non-capturing groups.
For the text
you posted, the pattern (-[a-zA-Z0-9]+)*\.
is unnecessary. There is no literal -
in text
so (-[a-zA-Z0-9]+)*
never matches anything in text
. Of course, you could add (?:-[a-zA-Z0-9]+)*
to the pattern if you wish (note the use of the non-capturing group (?:...)
), but that part of the pattern is not exercised by the text
you posted. It would allow you to match names with hypthens, however:
In [73]: re.findall(r'(?:[a-zA-Z0-9]+(?:-[a-zA-Z0-9]+)*\.)+[a-z]{2,10}', 'asu-psu.edu but not initial hyphens like -psu-asu.edu')
Out[73]: ['asu-psu.edu', 'psu-asu.edu']
And as Aprillion noted:
In [74]: re.findall(r'(?:[a-zA-Z0-9]+(?:-[a-zA-Z0-9]+)*\.)+[a-z]{2,10}', text)
Out[74]: ['asu.edu', 'tarantino.cs.ucsb.edu']
See regex101 for an explanation of the pattern (?:[a-zA-Z0-9]+\.)+[a-z]{2,10}
Upvotes: 0
Reputation: 1077
This should work:
import re
my_str = "ctcO6OgnWRAxLtu+akRCFwM asu.edu zOiV6Wo6nDnUhQkZO4XTySrTRwLMgozM9R/LyQs2r+Pb tarantino.cs.ucsb.edu,128.111.48.123 ssh-rsa 9SMF4U+qJW03Bh1"
my_list = re.findall(r'(([a-zA-Z0-9]+(-[a-zA-Z0-9]+)*\.)+[a-z]{2,10})', my_str)
print [i[0] for i in my_list]
As Gavin pointed out, you shouldn't use str
and list
as variable names because they are built-in types in Python.
Upvotes: 1