kuafu
kuafu

Reputation: 1486

how to get tags when attribute is chinese in beautifulsoup

I'm not familiar with beautifulsoup's encoding.

when I tackle with some pages,some attribute is chinese, and I want to use this chinese attribute to extract tags.

for example,a html like below:

<P class=img_s>
<A href="/pic/93/b67793.jpg" target="_blank" title="查看大图">
<IMG src="/pic/93/s67793.jpg">
</A>
</P>

I want to extract the '/pic/93/b67793.jpg' so what I done is:

img_urls = form_soup.findAll('a',title='查看大图')

and encounter:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xb2 in position 0: ordinalnot in range(128)

to tackle with this,I have done two method,both failed: one way is :

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

another way is:

response = unicode(response, 'gb2312','ignore').encode('utf-8','ignore') 

Upvotes: 1

Views: 563

Answers (2)

Leonard Richardson
Leonard Richardson

Reputation: 4164

Beautiful Soup 4.1.0 will automatically convert attribute values from UTF-8, which solves this problem:

Upvotes: 1

Martijn Pieters
Martijn Pieters

Reputation: 1122232

You need to pass in unicode to the findAll method:

# -*- coding: utf-8
... 
img_urls = form_soup.findAll('a', title=u'查看大图')

Note the u unicode literal marker in front of the title value. You do need to specify an encoding on your source file for this to work (the coding comment at the top of the file), or switch to unicode escape codes instead:

img_urls = form_soup.findAll('a', title=u'\u67e5\u770b\u5927\u56fe')

Internally, BeautifulSoup uses unicode, but you are passing it a byte-string with non-ascii characters in them. BeautifulSoup tries to decode that to unicode for you and fails as it doesn't know what encoding you used. By providing it with ready-made unicode instead you side-step the issue.

Working example:

>>> from BeautifulSoup import BeautifulSoup
>>> example = u'<P class=img_s>\n<A href="/pic/93/b67793.jpg" target="_blank" title="<A href="/pic/93/b67793.jpg" target="_blank" title="\u67e5\u770b\u5927\u56fe"><IMG src="/pic/93/s67793.jpg"></A></P>'
>>> soup = BeautifulSoup(example)
>>> soup.findAll('a', title=u'\u67e5\u770b\u5927\u56fe')
[<a href="/pic/93/b67793.jpg" target="_blank" title="查看大图"><img src="/pic/93/s67793.jpg" /></a>]

Upvotes: 6

Related Questions