how to get tags when attribute is chinese in beautifulsoup

Question

I'm not familiar with beautifulsoup's encoding.

when I tackle with some pages,some attribute is chinese, and I want to use this chinese attribute to extract tags.

for example,a html like below:

I want to extract the '/pic/93/b67793.jpg' so what I done is:

img_urls = form_soup.findAll('a',title='查看大图')

and encounter:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xb2 in position 0: ordinalnot in range(128)

to tackle with this,I have done two method,both failed: one way is :

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

another way is:

response = unicode(response, 'gb2312','ignore').encode('utf-8','ignore')

Martijn Pieters · Accepted Answer

You need to pass in unicode to the findAll method:

# -*- coding: utf-8
... 
img_urls = form_soup.findAll('a', title=u'查看大图')

Note the u unicode literal marker in front of the title value. You do need to specify an encoding on your source file for this to work (the coding comment at the top of the file), or switch to unicode escape codes instead:

img_urls = form_soup.findAll('a', title=u'\u67e5\u770b\u5927\u56fe')

Internally, BeautifulSoup uses unicode, but you are passing it a byte-string with non-ascii characters in them. BeautifulSoup tries to decode that to unicode for you and fails as it doesn't know what encoding you used. By providing it with ready-made unicode instead you side-step the issue.

Working example:

>>> from BeautifulSoup import BeautifulSoup
>>> example = u'
'
>>> soup = BeautifulSoup(example)
>>> soup.findAll('a', title=u'\u67e5\u770b\u5927\u56fe')
[]

how to get tags when attribute is chinese in beautifulsoup

Answers (2)

Related Questions