Reputation: 865

Python domain name list regex

I wish to get all the the domain names in the given string using python. i have tried the below but i am not getting the o/p as expected

str = "ctcO6OgnWRAxLtu+akRCFwM asu.edu zOiV6Wo6nDnUhQkZO4XTySrTRwLMgozM9R/LyQs2r+Pb tarantino.cs.ucsb.edu,128.111.48.123 ssh-rsa 9SMF4U+qJW03Bh1"
list = re.findall(r'([a-zA-Z0-9]+(-[a-zA-Z0-9]+)*\.)+[a-z]{2,10}', str)
print list

I want the output as:

asu.edu , tarantino.cs.ucsb.edu

but what I get is:

[('asu.', ''), ('ucsb.', '')]

What am I missing ?

Upvotes: 0

Answers (2)

unutbu

Reputation: 880697

In [63]: text = "ctcO6OgnWRAxLtu+akRCFwM asu.edu zOiV6Wo6nDnUhQkZO4XTySrTRwLMgozM9R/LyQs2r+Pb tarantino.cs.ucsb.edu,128.111.48.123 ssh-rsa 9SMF4U+qJW03Bh1"

In [64]: re.findall(r'(?:[a-zA-Z0-9]+\.)+[a-z]{2,10}', text)
Out[64]: ['asu.edu', 'tarantino.cs.ucsb.edu']

Use (?:...) to create a non-capturing group. When the pattern contains more than one grouping pattern (i.e. a pattern surrounded by parentheses), re.findall returns a tuple for each match. To prevent re.findall from returning a list of tuples, use non-capturing groups.
For the text you posted, the pattern (-[a-zA-Z0-9]+)*\. is unnecessary. There is no literal - in text so (-[a-zA-Z0-9]+)* never matches anything in text. Of course, you could add (?:-[a-zA-Z0-9]+)* to the pattern if you wish (note the use of the non-capturing group (?:...)), but that part of the pattern is not exercised by the text you posted. It would allow you to match names with hypthens, however:
```
In [73]: re.findall(r'(?:[a-zA-Z0-9]+(?:-[a-zA-Z0-9]+)*\.)+[a-z]{2,10}', 'asu-psu.edu but not initial hyphens like -psu-asu.edu')
Out[73]: ['asu-psu.edu', 'psu-asu.edu']
```
And as Aprillion noted:
```
In [74]: re.findall(r'(?:[a-zA-Z0-9]+(?:-[a-zA-Z0-9]+)*\.)+[a-z]{2,10}', text)
Out[74]: ['asu.edu', 'tarantino.cs.ucsb.edu']
```
See regex101 for an explanation of the pattern (?:[a-zA-Z0-9]+\.)+[a-z]{2,10}

Upvotes: 0

tjohnson

Reputation: 1077

This should work:

import re
my_str = "ctcO6OgnWRAxLtu+akRCFwM asu.edu zOiV6Wo6nDnUhQkZO4XTySrTRwLMgozM9R/LyQs2r+Pb tarantino.cs.ucsb.edu,128.111.48.123 ssh-rsa 9SMF4U+qJW03Bh1"
my_list = re.findall(r'(([a-zA-Z0-9]+(-[a-zA-Z0-9]+)*\.)+[a-z]{2,10})', my_str)
print [i[0] for i in my_list]

As Gavin pointed out, you shouldn't use str and list as variable names because they are built-in types in Python.

Upvotes: 1

Python domain name list regex

Answers (2)

Related Questions