Reputation: 45
I'm trying to read html file but when sourcing out for the titles and urls to compare with my keyword 'alist'
I get this error Unicode Encode Error: 'ascii' codec can't encode character u'\u2019'.
Error in link(http://tinypic.com/r/307w8bl/8)
Code
for q in soup.find_all('a'):
title = (q.get('title'))
url = ((q.get('href')))
length = len(alist)
i = 0
while length > 0:
if alist[i] in str(title): #checks for keywords from html form from the titles and urls
r.write(title)
r.write("\n")
r.write(url)
r.write("\n")
i = i + 1
length = length -1
doc.close()
r.close()
A little background. alist contains a list of keywords which I would use to compare it with title so as to get what I want. The strange thing is if alist contains 2 or more words, it would run perfectly but if there was only one word, the error as seen above would appear. Thanks in advance.
Upvotes: 3
Views: 8951
Reputation: 5193
If your list MUST BE a string list, try to encode title var
>>> alist=['á'] #asci string
>>> title = u'á' #unicode string
>>> alist[0] in title
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>> title and alist[0] in title.encode('utf-8')
True
>>>
Upvotes: 3
Reputation: 21243
The problem is in str(title)
. U are trying to convert unicode
data to string.
Why u are converting title
to string? You can direct access it.
soup.find_all
will return you list of strings.
Upvotes: 0
Reputation: 31250
Presumably, title
is a Unicode string that can contain any kind of character; str(title)
tries to turn it into a bytestring using the ASCII codec, but that fails because your title contains a non-ASCII character.
What are you trying to do? Why do you need to turn the title into a bytestring?
Upvotes: 0