Umair Ayub
Umair Ayub

Reputation: 21261

Python Scrapy not returning Chinese characters

I am scraping this link using Python Scrapy. All files have # -*- coding: utf-8 -*- in the start

And to extract title of product, I have this code.

response.css("h1.d-title::text").extract_first()

This shows

2017\xe6\x98\xa5\xe5\xa4\x8f\xe6\x96\xb0\xe6\xac\xbe\xe5\xa5\xb3\xe5\xa3\xab\xe8\xbf\x90\xe5\x8a\xa8\xe9\x9e\x8b\xe9\x9f\xa9\xe7\x89\x88\xe4\xbc\x91\xe9\x97\xb2\xe7\xbd\x91\xe5\x8d\x95\xe9\x9e\x8bsport shoes men\xe5\xa4\x96\xe8\xb4\xb8\xe6\x89\xb9\xe5\x8f\x91

And if I do

response.css("h1.d-title::text").extract_first().decode('gbk').encode('utf-8')

Its giving me error

UnicodeEncodeError: 'ascii' codec can't encode characters in position 4-19: ordinal not in range(128)

I have tried other alternatvies online but none of it worked.

Though if I do this in Python Terminal (Without Scrapy) it prints Chinese perfectly!.

>>> s = "2017\xe6\x98\xa5\xe5\xa4\x8f\xe6\x96\xb0\xe6\xac\xbe\xe5\xa5\xb3\xe5\xa3\xab\xe8\xbf\x90\xe5\x8a\xa8\xe9\x9e\x8b\xe9\x9f\xa9\xe7\x89\x88\xe4\xbc\x91\xe9\x97\xb2\xe7\xbd\x91\xe5\x8d\x95\xe9\x9e\x8bsport shoes men\xe5\xa4\x96\xe8\xb4\xb8\xe6\x89\xb9\xe5\x8f\x91"
>>> print s
2017春夏新款女士运动鞋韩版休闲网单鞋sport shoes men外贸批发

Why its giving correct output with print?

Upvotes: 1

Views: 1174

Answers (2)

Tiny.D
Tiny.D

Reputation: 6556

Scrapy Selector will returns a list of unicode strings, refer to Using selectors with regular expressions. What you need to do is to encode the unicode to UTF-8, no need to decode to gbk then encode back to utf-8.

title = response.css("h1.d-title::text").extract_first().encode('utf-8')

For print in python terminal, i think the default encoding of your environment is UTF-8, you can enter your python terminal:

>>> import sys
>>> print sys.stdout.encoding
UTF-8

when you print the unicode strings, it will convert to utf-8 then print out.

Upvotes: 1

Done Data Solutions
Done Data Solutions

Reputation: 2286

Based on your example code with print s I assume you are using Python 2.7

When I ran

response.css("h1.d-title::text").extract_first()

on the site you listed I got this as a result:

u'2017\u6625\u590f\u65b0\u6b3e\u5973\u58eb\u8fd0\u52a8\u978b\u97e9\u7248\u4f11\u95f2\u7f51\u5355\u978bsport shoes men\u5916\u8d38\u6279\u53d1'

Means scrapy is already converting the result into an unicode object (that's what I would have expected).

Then running decode('gbk') on it will fail as decode tries to interpret it as being an gbk-encoded string.

So if you you need to convert it to utf-8 (instead of just using the unicode object, which I would prefer) you should do this:

response.css("h1.d-title::text").extract_first().encode('utf-8')

Result:

'2017\xe6\x98\xa5\xe5\xa4\x8f\xe6\x96\xb0\xe6\xac\xbe\xe5\xa5\xb3\xe5\xa3\xab\xe8\xbf\x90\xe5\x8a\xa8\xe9\x9e\x8b\xe9\x9f\xa9\xe7\x89\x88\xe4\xbc\x91\xe9\x97\xb2\xe7\xbd\x91\xe5\x8d\x95\xe9\x9e\x8bsport shoes men\xe5\xa4\x96\xe8\xb4\xb8\xe6\x89\xb9\xe5\x8f\x91'

Which prints the same string you expect.

Besides this it is usually a good idea to use Python3 as it handles most of such situations out of the box.

Upvotes: 0

Related Questions