Reputation: 1075
I am trying to find all the emails on a given page and match them using a regex. I am using BeautifulSoup to get all the tags
email_re = re.compile('[A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*')
email = soup.findAll("a")
for j in email:
email = j.string
for match in email_re.findall(email):
outfile.write(match + "\n")
print match
However, when I run my script, this part of it gets a TypeError: expected string or buffer. I assume this is because email is a BeautifulSoup object, and not a python string. I have tried to convert it to a string using either str() or str() and both return another error: UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 9: ordinal not in range(128). What can I do to get around these errors, and actually have my script run. I am out of ideas. Please help!
Upvotes: 3
Views: 217
Reputation: 2342
Most likely, the match
variable has unicode
type. To write it to a file, one needs to encode it using some encoding. By default, Python tries to encode it using ASCII encoding. Please try the following:
outfile.write(match.encode('utf-8') + "\n")
You may also want to change the UTF-8
encoding to some that your outfile is supposed to have.
Also there's a nice Unicode HOWTO for Python 2.x. But please note that Python 3 has another, much more logical approach to deal with Unicode.
Upvotes: 3