Reputation: 1982
For some reason, when I grab a text value from a element using Scrapy, it display correctly, but when I put it in an array, it gets encoded improperly.
Here is the test: I used the word Château
. In one case test, scrapy gets word then prints and adds it to an array. In the second case test2, I literally copy paste the word that was printed from another test into the array.
Here is my Scrapy python script:
value=node.xpath('//AddrDisplayMemberSerialization/text()').extract_first()
print value;
array={'test':value,'test2':'Château'}
print array
Automatically, the array encodes the values. Does python do this automatically or does Scrapy do this?
And why do they get encoded differently?
Upvotes: 0
Views: 247
Reputation: 146580
The issue happens because of difference between Python2 and Python3. If you do this in Python3 it would work straight away
Python 3.6.2 (default, Jul 17 2017, 16:44:45)
[GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.42)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> value = 'Château'
>>> print (value)
Château
>>> array={'test':value,'test2':'Château'}
>>> print(array)
{'test': 'Château', 'test2': 'Château'}
>>>
Now let's get back to Python2
Python 2.7.13 (default, Jul 18 2017, 09:17:00)
[GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.42)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> value = 'Château'
>>> print value;
Château
>>> array={'test':value,'test2':'Château'}
>>> print array
{'test': 'Ch\xc3\xa2teau', 'test2': 'Ch\xc3\xa2teau'}
This happens because when you print array, it is converting to string representation and not unicode in python
>>> str(array)
"{'test': 'Ch\\xc3\\xa2teau', 'test2': 'Ch\\xc3\\xa2teau'}"
>>> print str(array)
{'test': 'Ch\xc3\xa2teau', 'test2': 'Ch\xc3\xa2teau'}
What you want to do while printing is do unicode escape
>>> print str(array).decode("unicode-escape")
{'test': 'Château', 'test2': 'Château'}
But wait this messes up the print? That is because of the encoding needed to print these character. Latin in short
>>> print str(array).decode("unicode-escape").encode("latin-1")
{'test': 'Château', 'test2': 'Château'}
Just upgrade to python3 and your issues would be sorted. But you would need to change your print statements to print(...)
. Or else workout the encodings using code as I showed
Upvotes: 1
Reputation: 21271
That's how it will be shown in Terminal.
But if you want it to show in utf-8 just do this in settings.py
FEED_EXPORT_ENCODING = 'utf-8'
Upvotes: 1