abhinav singh
abhinav singh

Reputation: 1104

Sorting the unique values from regex match in python

I am trying to parse a log file to extract email addresses. I am able to match the email and print it with the help of regular expressions. I noticed that there are a couple of duplicate emails in my log file. Can you help me in figuring out how I can remove the duplicates and print only the unique email addresses based on matched patterns.

Here is the code I have written so far :

import sys
import re

file = open('/Users/me/Desktop/test.txt', 'r')
temp =[]
for line in file.readlines():
    if '->' in line:
        temp = line.split('->')
    elif '=>' in line:
        temp = line.split('=>')

    if temp:
        #temp[1].strip()
        pattern = re.match('^\x20\w{1,}@\w{1,}\.\w{2,3}\x20?', str(temp[1]), re.M)
        if pattern is not None:
            print pattern.group()

        else:
            print "nono"

Here is my example log file that I am trying to parse:

Feb 24 00:00:23 smtp1.mail.net exim[5660]: 2014-02-24 00:00:23 1Wuniq-mail-idSo-Fg -> [email protected] R=mail T=remote_smtp H=smtp.mail.net [000.00.34.17]

Feb 24 00:00:23 smtp1.mail.net exim[5660]: 2014-02-24 00:00:23 1Wuniq-mail-idSo-Fg -> [email protected] R=mail T=remote_smtp H=smtp.mail.net [000.00.34.17]

Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23 1Wuniq-mail-idSm-1h => [email protected] R=mail T=pop_mail_net H=mta.mail.net [000.00.34.6]

Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23 1Wuniq-mail-idSm-1h => [email protected] R=mail T=pop_mail_net H=mta.mail.net [000.00.34.6]

Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23 1Wuniq-mail-idSm-1h => [email protected] R=mail T=pop_mail_net H=mta.mail.net [000.00.34.6]

Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23 1Wuniq-mail-idSm-1h => [email protected] R=mail T=pop_mail_net H=mta.mail.net [000.00.34.6]

Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23 1Wuniq-mail-idSm-1h Completed

Also, I am curious if I can improve my program or the regex. Any suggestion would be very helpful.

Thanks in advance.

Upvotes: 2

Views: 2150

Answers (3)

mhawke
mhawke

Reputation: 87084

Some of the duplicates are due to a bug in your code where you do not reset temp when processing each line. A line that does not contain either -> or => and which is preceded by a line that does contain either of those strings will trigger the if temp: test, and output the email address from the previous line if there was one.

That can be fixed by jumping back to the start of the loop with continue when the line contains neither -> nor =>.

For the other genuine duplicates that occur because the same email address appears in multiple lines, you can filter those out with a set.

import sys
import re

addresses = set()
pattern = re.compile('^\x20\w{1,}@\w{1,}\.\w{2,3}\x20?')

with open('/Users/me/Desktop/test.txt', 'r') as f:
    for line in f:
        if '->' in line:
            temp = line.split('->')
        elif '=>' in line:
            temp = line.split('=>')
        else:
            # neither '=>' nor '->' present in the line
            continue

        match = pattern.match(temp[1])
        if match is not None:
            addresses.add(match.group())
        else:
            print "nono"

for address in sorted(addresses):
    print(address)

The addresses are stored in a set to remove duplicates. Then they are sorted and printed. Note also the use of the with statement to open the file within a context manager. This guarantees that the file will always be closed.

Also, as you will be applying the same regex pattern many times, it is worth compiling it ahead of time for better efficiency.

With a properly written regex pattern your code can be greatly simplified:

import re

addresses = set()
pattern = re.compile(r'[-=]> +(\w{1,}@\w{1,}\.\w{2,3})')

with open('test.txt', 'r') as f:
    for line in f:
        match = pattern.search(line)
        if match:
            addresses.add(match.groups()[0])

for address in sorted(addresses):
    print(address)

Upvotes: 1

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210852

As danidee (he was first) said, set would do the trick

Try this:

from __future__ import print_function

import re

with open('test.txt') as f:
    data = f.read().splitlines()

emails = set(re.sub(r'^.*\s+(\w+\@[^\s]*?)\s+.*', r'\1', line) for line in data if '@' in line)

print('\n'.join(emails)) if len(emails) else print('nono')

Output:

[email protected]
[email protected]
[email protected]
[email protected]

PS you may want to do a proper email RegExp check, because i used very primitive check

Upvotes: 2

Kasravnd
Kasravnd

Reputation: 107297

You can use a set container in order to preserve the unique results and each time that you want to print a matched email you can check if it doesn't exist in your set you print it:

import sys
import re

file = open('/Users/me/Desktop/test.txt', 'r')
temp =[]
seen = set()
for line in file.readlines():
    if '->' in line:
        temp = line.split('->')
    elif '=>' in line:
        temp = line.split('=>')

    if temp:
        #temp[1].strip()
        pattern = re.match('^\x20\w{1,}@\w{1,}\.\w{2,3}\x20?', str(temp[1]), re.M)
        if pattern is not None:
            matched =  pattern.group()
            if matched not in seen:
               print matched 

        else:
            print "nono"

Upvotes: 1

Related Questions