Beautiful Soup Type Error and Regex

Question

I am trying to find all the emails on a given page and match them using a regex. I am using BeautifulSoup to get all the tags

email_re = re.compile('[A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*')

email = soup.findAll("a")
for j in email:
    email = j.string
    for match in email_re.findall(email):
        outfile.write(match + "
")
        print match

However, when I run my script, this part of it gets a TypeError: expected string or buffer. I assume this is because email is a BeautifulSoup object, and not a python string. I have tried to convert it to a string using either str() or str() and both return another error: UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 9: ordinal not in range(128). What can I do to get around these errors, and actually have my script run. I am out of ideas. Please help!

Aleksei Zyrianov · Accepted Answer

Most likely, the match variable has unicode type. To write it to a file, one needs to encode it using some encoding. By default, Python tries to encode it using ASCII encoding. Please try the following:

outfile.write(match.encode('utf-8') + "
")

You may also want to change the UTF-8 encoding to some that your outfile is supposed to have.

Also there's a nice Unicode HOWTO for Python 2.x. But please note that Python 3 has another, much more logical approach to deal with Unicode.

Beautiful Soup Type Error and Regex

Answers (1)

Related Questions