Reputation: 465

Scrapy retrieves text encoding incorrectly, hebrew as \u0d5 etc

First time working with this stuff. Checked out all other SOF questions about internalization / text encoding.

I'm doing the Scrapy tutorial, when I got stuck at this part: Extracting Data, When I extract the data, the text instead of hebrew displayed as a series of \uXXXX.

it's possible for you to check it out by scraping this page for example;

scrapy shell http://israblog.nana10.co.il/blogread.asp?blog=167524&blogcode=13348970
hxs.select('//h2[@class="title"]/text()').extract()[0]

this will retrieve

u'\u05de\u05d9 \u05d0\u05e0\u05e1 \u05e4\u05d5\u05d8\u05e0\u05e6\u05d9\u05d0\u05dc\u05d9?'

(unrelated:) if you try to print it in the console, you get: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python27\lib\encodings\cp437.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-1: cha racter maps to <undefined>

Tried setting the encoding through the settings, tried converting manually, basically I feel like I tried everything.

(I've gone already about 5 pomodoros trying to fix this!)

what can I do to get the hebrew text that should be there: "מי אנס פוטנציאלי?"

(Disclaimer: I just went into the first blog and post I noticed on http://Israblog.co.il, I'm in no way related to the blog or blog owner, I just used it as an example)

Upvotes: 1

Answers (2)

warvariuc

Reputation: 59644

what can I do to get the hebrew text that should be there: "מי אנס פוטנציאלי?"

test.py:

# coding: utf-8

a = u'\u05de\u05d9 \u05d0\u05e0\u05e1 \u05e4\u05d5\u05d8\u05e0\u05e6\u05d9\u05d0\u05dc\u05d9?'
b = 'מי אנס פוטנציאלי?'

print a
print b

Result:

vic@wic:~/projects/snippets$ python test.py 
מי אנס פוטנציאלי?
מי אנס פוטנציאלי?
vic@wic:~/projects/snippets$

As you see they are the same. It's just different representation of the same unicode string. So don't worry that it's not scraped correctly.

If you want to save it to a file:

Python 2.7.3 (default, Apr 20 2012, 22:39:59) 
[GCC 4.6.3] on linux2
>>> a = u'\u05de\u05d9 \u05d0\u05e0\u05e1 \u05e4\u05d5\u05d8\u05e0\u05e6\u05d9\u05d0\u05dc\u05d9'
>>> a
u'\u05de\u05d9 \u05d0\u05e0\u05e1 \u05e4\u05d5\u05d8\u05e0\u05e6\u05d9\u05d0\u05dc\u05d9'
>>> f = open('test.txt', 'w')
>>> f.write(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
>>> f.write(a.encode('utf-8'))
>>> f.close()

Upvotes: 2

iblazevic

Reputation: 2733

Have you tried to see what do you get when storing that information you get from the page somewhere in json, xml....

I had those problems with some signs on few sites and in most cases if you don't do anything with the retrieved data it gets stored properly, but if you try to print them out in console you won't get proper result, or it will give error if you don't use repr

print repr(data)

I hope this helps, cause I know the frustration of encoding problems.

Upvotes: 0

Scrapy retrieves text encoding incorrectly, hebrew as \u0d5 etc

Answers (2)

Related Questions