Jtuck4491
Jtuck4491

Reputation: 45

How do I exclude a string from re.findall?

This might be a silly question, but I'm just trying to learn!

I'm trying to build a simple email search tool to learn more about python. I'm modifying some open source code to parse the email address:

emails = re.findall(r'([A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*)', html)

Then I'm writing the results into a spreadsheet using the CSV module.

Since I'd like to keep the domain extension open to almost any, my results are outputting image files with an email type format:

example: [email protected]

How can I add to exclude "png" string from re.findall

Code:

  def scrape(self, page):
    try:
        request = urllib2.Request(page.url.encode("utf8"))
        html    = urllib2.urlopen(request).read()
    except Exception, e:
        return
       emails = re.findall(r'([A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*)', html)
       for email in emails:
        if email not in self.emails:  # if not a duplicate
            self.csvwriter.writerow([page.title.encode('utf8'), page.url.encode("utf8"), email])
            self.emails.append(email)

Upvotes: 3

Views: 7088

Answers (3)

Adam Smith
Adam Smith

Reputation: 54213

Lots of ways to do this, but my favorite is:

pat = re.compile(r'''
          [A-Za-z0-9\.\+_-]+ # 1+ \w\n.+-_
          @[A-Za-z0-9\._-]+  # literal @ followed by same
          \.png              # if png, DON'T CAPTURE
          |([A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*)
                             # if not png, CAPTURE''', flags=re.X)

Since regexes are evaluated left-to-right, if a string starts to match then it will match the left side of the | first. If the string ends in .png, then it will consume that string but NOT capture it. If it DOESN'T end in .png, the right side of the | will begin to consume it and WILL capture it. For a more in-depth conversation of this trick, see here. To use these do:

matches = filter(None,pat.findall(html))

Any string matched by the left side (e.g. all the png files that are matched but NOT part of a capturing group) will show up as an empty string in your findall. filter(None, iterable) removes all the empty strings from your iterable, leaving you with only the data you want.

Alternatively, you can filter after you grab everything

pat = re.compile(r'''[A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*''')
# same regex you have currently
matches = filter(lambda x: not x.endswith('png'), pat.findall(html))

Note that further on, you should really make self.emails a set. It doesn't seem to need to keep its ordering, and set lookup is WAY faster than list lookup. Remember to use set.add instead of list.append though.

Upvotes: 2

Zhouster
Zhouster

Reputation: 746

I know Joran already gave you a response, but here's another way to do it with Python regex that I found cool.

There is a (?!...) matching pattern that essentially says: "Wherever you place this matching pattern, if at that point in the string this pattern is checked and a match is found, then that match occurrence fails."

If that was a bad explanation, the Python document does a much better job: https://docs.python.org/2/howto/regex.html#lookahead-assertions

Also, here is a working example:

y = r'([A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.(?!png)[a-zA-z]*)'
s = '[email protected]'
re.findall(y, s) # Will return an empty list

s2 = '[email protected]'
re.findall(y, s2) # Will return a list with s2 string

s3 = s + ' ' + s2 # Concatenates the two e-mail-formatted strings
re.findall(y, s3) # Will only return s2 string in list

Upvotes: 2

Joran Beasley
Joran Beasley

Reputation: 113988

you already are only acting on an if ... just make part of the if check ... ...that will be much much much easier than trying to exclude it from the regex

if email not in self.emails and not email.endswith("png"):  # if not a duplicate
        self.csvwriter.writerow([page.title.encode('utf8'), page.url.encode("utf8"), email])
        self.emails.append(email)

Upvotes: 2

Related Questions