Reputation: 1671
I am having hard time last 2 weeks to handle some data that I scraped with scrapy. I am using python 2.7 on a windows7. This is a small snippet of data scraped and extracted through scrapy xpath selector:
{'city': [u'Mangenberger Str.\xa0162', u'42655\xa0Solingen']}
These data are scraped from a page utf-8 encoded, at least that is what it says:
Content-Type: text/html;charset=utf-8
So I believe that I need to decode them in order to get:
Mangenberger Str. 16242655 Solingen
This is what I am getting in my console:
>>> s='Mangenberger Str.\xc2\xa016242655\xc2\xa0Solingen'
>>> s1=s.decode('utf-8')
>>> print s1
Mangenberger Str. 16242655 Solingen
Perfect! But this is far away from what I receive when I run my script. I tried to encode and decode:
uft-8 encoding
{'city': 'Mangenberger Str.\xc2\xa016242655\xc2\xa0Solingen'}
exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 17:
utf-8-sig encoding
{'city': '\xef\xbb\xbfMangenberger Str.\xc2\xa016242655\xc2\xa0Solingen'}
exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0:
utf-8 decoding
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 17:
utf-8-sig decoding
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 17:
Encode code:
item['city']= "".join(element.select('//div[@id="bubble_2"]/div/text()').extract()).encode('utf-8')
Decode code:
item['city']= "".join(element.select('//div[@id="bubble_2"]/div/text()').extract()).decode('utf-8')
From what I understand that BOM byte is the problem in case when I try to decode this string? But then why does it work without problems in my console and doesn't work (error) once I run scrapy?
Upvotes: 1
Views: 2683
Reputation: 20748
\xa0
in that Python unicode string is the Non-breaking space character
u'Mangenberger Str.\xa0162'
and u'42655\xa0Solingen'
are perfectly valid unicode strings. Python works with unicode strings wonderfully.
Scrapy XPath selector extract()
calls get you list of unicode strings. And dealing with unicode all along is usually the way to go.
I would NOT recommend encoding the unicode string to something else in your scrapy code. (and it's encoding you're after, decoding is for non-unicode strings to convert them to unicode strings)
The only step it makes sense to encode the strings is at the end, when exporting the data (CSV, XML) and even that is handled already.
Maybe you can explain what is causing you trouble with these unicode strings.
Upvotes: 2