Reputation: 31

Regular Expression in Python

I'm trying to build a list of domain names from an Enom API call. I get back a lot of information and need to locate the domain name related lines, and then join them together.

The string that comes back from Enom looks somewhat like this:

SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
TLDOverride=1

I'd like to build a list from that which looks like this:

[domain1.com, domain2.org, domain3.co.uk, domain4.net]

To find the different domain name components I've tried the following (where "enom" is the string above) but have only been able to get the SLD and TLD matches.

re.findall("^.*(SLD|TLD).*$", enom, re.M)

Upvotes: 2

Answers (9)

mata

Reputation: 69082

You have a capturing group in your expression. re.findall documentation says:

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

That's why only the conent of the capturing group is returned.

try:

re.findall("^.*((?:SLD|TLD)\d*)=(.*)$", enom, re.M)

This would return a list of tuples:

[('SLD1', 'domain1'), ('TLD1', 'com'), ('SLD2', 'domain2'), ('TLD2', 'org'), ('SLD3', 'domain3'), ('TLD4', 'co.uk'), ('SLD5', 'domain4'), ('TLD5', 'net')]

Combining SLDs and TLDs is then up to you.

Upvotes: 4

zenpoy

Reputation: 20136

Edit: Every time I see a question asking for regular expression solution I have this bizarre urge to try and solve it without regular expressions. Most of the times it's more efficient than the use of regex, I encourage the OP to test which of the solutions is most efficient.

Here is the naive approach:

a = """SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
TLDOverride=1"""

b = a.split("\n")
c = [x.split("=")[1] for x in b if x != 'TLDOverride=1']
for x in range(0,len(c),2):
    print ".".join(c[x:x+2])

>> domain1.com
>> domain2.org
>> domain3.co.uk
>> domain4.net

Upvotes: 6

kirelagin

Reputation: 13626

I'm not sure why are you talking about regular expressions. I mean, why don't you just run a for loop?

A famous quote seems to be appropriate here:

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

domains = []
components = []
for line in enom.split('\n'):
  k,v = line.split('=')
  if k == 'TLDOverride':
    continue
  components.append(v)
  if k.startswith('TLD'):
    domains.append('.'.join(components))
    components = []

P.S. I'm not sure what's this TLDOverride so the code just ignores it.

Upvotes: 3

l4mpi

Reputation: 5149

As some other answers already said, there's no need to use a regular expression here. A simple split and some filtering will do nicely:

lines = data.split("\n") #assuming data contains your input string
sld, tld = [[x.split("=")[1] for x in lines if x[:3] == t] for t in ("SLD", "TLD")]
result = [x+y for x, y in zip(sld, tld)]

Upvotes: 1

Andrei Kaigorodov

Reputation: 2165

Just for fun, map -> filter -> map:

input = """
SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
"""

splited = map(lambda x: x.split("="), input.split())
slds = filter(lambda x: x[1][0].startswith('SLD'), enumerate(splited))
print map(lambda x: '.'.join([x[1][1], splited[x[0] + 1][1], ]), slds)

>>> ['domain1.com', 'domain2.org', 'domain3.co.uk', 'domain4.net']

Upvotes: 2

rh0dium

Reputation: 7052

You need to use multiline regex for this. This is similar to this post.

data = """SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
TLDOverride=1"""

domain_seq = re.compile(r"SLD\d=(\w+)\nTLD\d=(\w+)", re.M)
for item in domain_seq.finditer(data):
    domain, tld = item.group(1), item.group(2)
    print "%s.%s" % (domain,tld)

Upvotes: 1

georg

Reputation: 215039

This appears to do what you want:

domains = re.findall('SLD\d+=(.+)', re.sub(r'\nTLD\d+=', '.', enom))

It assumes that the lines are sorted and SLD always comes before its TLD. If that can be not the case, try this slightly more verbose code without regexes:

d = dict(x.split('=') for x in enom.strip().splitlines())

domains = [
    d[key] + '.' + d.get('T' + key[1:], '') 
    for key in d if key.startswith('SLD')
]

Upvotes: 1

Jon Clements

Reputation: 142236

Here's one way:

import re
print map('.'.join,  zip(*[iter(re.findall(r'^(?:S|T)LD\d+=(.*)$', text, re.M))]*2))
# ['domain1.com', 'domain2.org', 'domain3.co.uk', 'domain4.net']

Upvotes: 2

Qiang Jin

Reputation: 4467

this works for you example,

>>> sld_list = re.findall("^.*SLD[0-9]*?=(.*?)$", enom, re.M)
>>> tld_list = re.findall("^.*TLD[0-9]*?=(.*?)$", enom, re.M)
>>> map(lambda x: x[0] + '.' + x[1], zip(sld_list, tld_list))
['domain1.com', 'domain2.org', 'domain3.co.uk', 'domain4.net']

Upvotes: 3

Regular Expression in Python

Answers (9)

Related Questions