Reputation: 45
This might be a silly question, but I'm just trying to learn!
I'm trying to build a simple email search tool to learn more about python. I'm modifying some open source code to parse the email address:
emails = re.findall(r'([A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*)', html)
Then I'm writing the results into a spreadsheet using the CSV module.
Since I'd like to keep the domain extension open to almost any, my results are outputting image files with an email type format:
example: [email protected]
How can I add to exclude "png" string from re.findall
Code:
def scrape(self, page): try: request = urllib2.Request(page.url.encode("utf8")) html = urllib2.urlopen(request).read() except Exception, e: return emails = re.findall(r'([A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*)', html) for email in emails: if email not in self.emails: # if not a duplicate self.csvwriter.writerow([page.title.encode('utf8'), page.url.encode("utf8"), email]) self.emails.append(email)
Upvotes: 3
Views: 7088
Reputation: 54213
Lots of ways to do this, but my favorite is:
pat = re.compile(r'''
[A-Za-z0-9\.\+_-]+ # 1+ \w\n.+-_
@[A-Za-z0-9\._-]+ # literal @ followed by same
\.png # if png, DON'T CAPTURE
|([A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*)
# if not png, CAPTURE''', flags=re.X)
Since regexes are evaluated left-to-right, if a string starts to match then it will match the left side of the |
first. If the string ends in .png
, then it will consume that string but NOT capture it. If it DOESN'T end in .png
, the right side of the |
will begin to consume it and WILL capture it. For a more in-depth conversation of this trick, see here. To use these do:
matches = filter(None,pat.findall(html))
Any string matched by the left side (e.g. all the png
files that are matched but NOT part of a capturing group) will show up as an empty string in your findall. filter(None, iterable
) removes all the empty strings from your iterable, leaving you with only the data you want.
Alternatively, you can filter after you grab everything
pat = re.compile(r'''[A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*''')
# same regex you have currently
matches = filter(lambda x: not x.endswith('png'), pat.findall(html))
Note that further on, you should really make self.emails
a set. It doesn't seem to need to keep its ordering, and set lookup is WAY faster than list lookup. Remember to use set.add
instead of list.append
though.
Upvotes: 2
Reputation: 746
I know Joran already gave you a response, but here's another way to do it with Python regex that I found cool.
There is a (?!...)
matching pattern that essentially says: "Wherever you place this matching pattern, if at that point in the string this pattern is checked and a match is found, then that match occurrence fails."
If that was a bad explanation, the Python document does a much better job: https://docs.python.org/2/howto/regex.html#lookahead-assertions
Also, here is a working example:
y = r'([A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.(?!png)[a-zA-z]*)'
s = '[email protected]'
re.findall(y, s) # Will return an empty list
s2 = '[email protected]'
re.findall(y, s2) # Will return a list with s2 string
s3 = s + ' ' + s2 # Concatenates the two e-mail-formatted strings
re.findall(y, s3) # Will only return s2 string in list
Upvotes: 2
Reputation: 113988
you already are only acting on an if ... just make part of the if check ... ...that will be much much much easier than trying to exclude it from the regex
if email not in self.emails and not email.endswith("png"): # if not a duplicate
self.csvwriter.writerow([page.title.encode('utf8'), page.url.encode("utf8"), email])
self.emails.append(email)
Upvotes: 2