Reputation: 1486
I'm not familiar with beautifulsoup's encoding.
when I tackle with some pages,some attribute is chinese, and I want to use this chinese attribute to extract tags.
for example,a html like below:
<P class=img_s>
<A href="/pic/93/b67793.jpg" target="_blank" title="查看大图">
<IMG src="/pic/93/s67793.jpg">
</A>
</P>
I want to extract the '/pic/93/b67793.jpg' so what I done is:
img_urls = form_soup.findAll('a',title='查看大图')
and encounter:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb2 in position 0: ordinalnot in range(128)
to tackle with this,I have done two method,both failed: one way is :
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
another way is:
response = unicode(response, 'gb2312','ignore').encode('utf-8','ignore')
Upvotes: 1
Views: 563
Reputation: 4164
Beautiful Soup 4.1.0 will automatically convert attribute values from UTF-8, which solves this problem:
Upvotes: 1
Reputation: 1122232
You need to pass in unicode to the findAll method:
# -*- coding: utf-8
...
img_urls = form_soup.findAll('a', title=u'查看大图')
Note the u
unicode literal marker in front of the title value. You do need to specify an encoding on your source file for this to work (the coding
comment at the top of the file), or switch to unicode escape codes instead:
img_urls = form_soup.findAll('a', title=u'\u67e5\u770b\u5927\u56fe')
Internally, BeautifulSoup uses unicode, but you are passing it a byte-string with non-ascii characters in them. BeautifulSoup tries to decode that to unicode for you and fails as it doesn't know what encoding you used. By providing it with ready-made unicode instead you side-step the issue.
Working example:
>>> from BeautifulSoup import BeautifulSoup
>>> example = u'<P class=img_s>\n<A href="/pic/93/b67793.jpg" target="_blank" title="<A href="/pic/93/b67793.jpg" target="_blank" title="\u67e5\u770b\u5927\u56fe"><IMG src="/pic/93/s67793.jpg"></A></P>'
>>> soup = BeautifulSoup(example)
>>> soup.findAll('a', title=u'\u67e5\u770b\u5927\u56fe')
[<a href="/pic/93/b67793.jpg" target="_blank" title="查看大图"><img src="/pic/93/s67793.jpg" /></a>]
Upvotes: 6