unicode issue with scraped data via scrapy

Question

I am having hard time last 2 weeks to handle some data that I scraped with scrapy. I am using python 2.7 on a windows7. This is a small snippet of data scraped and extracted through scrapy xpath selector:

{'city': [u'Mangenberger Str.\xa0162', u'42655\xa0Solingen']}

These data are scraped from a page utf-8 encoded, at least that is what it says:

Content-Type: text/html;charset=utf-8

So I believe that I need to decode them in order to get:

Mangenberger Str. 16242655 Solingen

This is what I am getting in my console:

>>> s='Mangenberger Str.\xc2\xa016242655\xc2\xa0Solingen'
>>> s1=s.decode('utf-8')
>>> print s1
Mangenberger Str. 16242655 Solingen

Perfect! But this is far away from what I receive when I run my script. I tried to encode and decode:

uft-8 encoding
{'city': 'Mangenberger Str.\xc2\xa016242655\xc2\xa0Solingen'}
exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 17:

utf-8-sig encoding
{'city': '\xef\xbb\xbfMangenberger Str.\xc2\xa016242655\xc2\xa0Solingen'}
exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0:

utf-8 decoding
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in      position 17:

utf-8-sig decoding
exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 17:

Encode code:

item['city']= "".join(element.select('//div[@id="bubble_2"]/div/text()').extract()).encode('utf-8')

Decode code:

item['city']= "".join(element.select('//div[@id="bubble_2"]/div/text()').extract()).decode('utf-8')

From what I understand that BOM byte is the problem in case when I try to decode this string? But then why does it work without problems in my console and doesn't work (error) once I run scrapy?

unicode issue with scraped data via scrapy

Answers (1)

Related Questions