Reputation: 1104
I am trying to parse a log file to extract email addresses. I am able to match the email and print it with the help of regular expressions. I noticed that there are a couple of duplicate emails in my log file. Can you help me in figuring out how I can remove the duplicates and print only the unique email addresses based on matched patterns.
Here is the code I have written so far :
import sys
import re
file = open('/Users/me/Desktop/test.txt', 'r')
temp =[]
for line in file.readlines():
if '->' in line:
temp = line.split('->')
elif '=>' in line:
temp = line.split('=>')
if temp:
#temp[1].strip()
pattern = re.match('^\x20\w{1,}@\w{1,}\.\w{2,3}\x20?', str(temp[1]), re.M)
if pattern is not None:
print pattern.group()
else:
print "nono"
Here is my example log file that I am trying to parse:
Feb 24 00:00:23 smtp1.mail.net exim[5660]: 2014-02-24 00:00:23 1Wuniq-mail-idSo-Fg -> [email protected] R=mail T=remote_smtp H=smtp.mail.net [000.00.34.17]
Feb 24 00:00:23 smtp1.mail.net exim[5660]: 2014-02-24 00:00:23 1Wuniq-mail-idSo-Fg -> [email protected] R=mail T=remote_smtp H=smtp.mail.net [000.00.34.17]
Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23 1Wuniq-mail-idSm-1h => [email protected] R=mail T=pop_mail_net H=mta.mail.net [000.00.34.6]
Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23 1Wuniq-mail-idSm-1h => [email protected] R=mail T=pop_mail_net H=mta.mail.net [000.00.34.6]
Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23 1Wuniq-mail-idSm-1h => [email protected] R=mail T=pop_mail_net H=mta.mail.net [000.00.34.6]
Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23 1Wuniq-mail-idSm-1h => [email protected] R=mail T=pop_mail_net H=mta.mail.net [000.00.34.6]
Feb 24 00:00:23 smtp1.mail.net exim[5661]: 2014-02-24 00:00:23 1Wuniq-mail-idSm-1h Completed
Also, I am curious if I can improve my program or the regex. Any suggestion would be very helpful.
Thanks in advance.
Upvotes: 2
Views: 2150
Reputation: 87084
Some of the duplicates are due to a bug in your code where you do not reset temp
when processing each line. A line that does not contain either ->
or =>
and which is preceded by a line that does contain either of those strings will trigger the if temp:
test, and output the email address from the previous line if there was one.
That can be fixed by jumping back to the start of the loop with continue
when the line contains neither ->
nor =>
.
For the other genuine duplicates that occur because the same email address appears in multiple lines, you can filter those out with a set
.
import sys
import re
addresses = set()
pattern = re.compile('^\x20\w{1,}@\w{1,}\.\w{2,3}\x20?')
with open('/Users/me/Desktop/test.txt', 'r') as f:
for line in f:
if '->' in line:
temp = line.split('->')
elif '=>' in line:
temp = line.split('=>')
else:
# neither '=>' nor '->' present in the line
continue
match = pattern.match(temp[1])
if match is not None:
addresses.add(match.group())
else:
print "nono"
for address in sorted(addresses):
print(address)
The addresses are stored in a set to remove duplicates. Then they are sorted and printed. Note also the use of the with
statement to open the file within a context manager. This guarantees that the file will always be closed.
Also, as you will be applying the same regex pattern many times, it is worth compiling it ahead of time for better efficiency.
With a properly written regex pattern your code can be greatly simplified:
import re
addresses = set()
pattern = re.compile(r'[-=]> +(\w{1,}@\w{1,}\.\w{2,3})')
with open('test.txt', 'r') as f:
for line in f:
match = pattern.search(line)
if match:
addresses.add(match.groups()[0])
for address in sorted(addresses):
print(address)
Upvotes: 1
Reputation: 210852
As danidee (he was first) said, set would do the trick
Try this:
from __future__ import print_function
import re
with open('test.txt') as f:
data = f.read().splitlines()
emails = set(re.sub(r'^.*\s+(\w+\@[^\s]*?)\s+.*', r'\1', line) for line in data if '@' in line)
print('\n'.join(emails)) if len(emails) else print('nono')
Output:
[email protected]
[email protected]
[email protected]
[email protected]
PS you may want to do a proper email RegExp check, because i used very primitive check
Upvotes: 2
Reputation: 107297
You can use a set
container in order to preserve the unique results and each time that you want to print a matched email you can check if it doesn't exist in your set you print it:
import sys
import re
file = open('/Users/me/Desktop/test.txt', 'r')
temp =[]
seen = set()
for line in file.readlines():
if '->' in line:
temp = line.split('->')
elif '=>' in line:
temp = line.split('=>')
if temp:
#temp[1].strip()
pattern = re.match('^\x20\w{1,}@\w{1,}\.\w{2,3}\x20?', str(temp[1]), re.M)
if pattern is not None:
matched = pattern.group()
if matched not in seen:
print matched
else:
print "nono"
Upvotes: 1