Reputation: 63
I'm using python framework scrapy to scrape data, here is code for my spider:
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//h1')
links = hxs.select('//div[@class="pp-title"]')
#sites = hxs.select('//div[@id="yt-lockup-content"] ')
items = []
for site in links:
item = DmozItem()
item['title'] = site.select('a/h1/text()').extract()
item['link'] = site.select('a/@href').extract()
items.append(item)
return items
I collect data in items.json
with this spider, I run spider with command scrapy crawl dmoz -o items.json -t json
. Data are stored in format
[[{"link": ["http://www.ponudadana.hr/Planinarski-dom-Kalnik-2-dana-s-doruckom-za-dvoje-za-149kn-umjesto-300kn-7482_1"], "title": ["Planinarski dom Kalnik - 2 dana s doru\u010dkom za dvoje za 149kn umjesto 300kn!"]},
The problem is that special characters like č,ž,š,đ,ž
are stored as \u010
or similar, for example see the word above doru\u010dkom
, it should be doručkom
. Can anyone help me, should I use some encoding format?
Upvotes: 1
Views: 1399
Reputation: 6867
Whether it's JSON or Python unicode string literals, \u010d
means č
. Even if it's represented like that in JSON, when you decode it, it will come out as a proper letter č
.
>>> import json
>>> obj = json.loads("""{"link": ["http://www.ponudadana.hr/Planinarski-dom-Kalnik-2-dana-s-doruckom-za-dvoje-za-149kn-umjesto-300kn-7482_1"], "title": ["Planinarski dom Kalnik - 2 dana s doru\u010dkom za dvoje za 149kn umjesto 300kn!"]}""")
>>> obj['title']
[u'Planinarski dom Kalnik - 2 dana s doru\u010dkom za dvoje za 149kn umjesto 300kn!']
>>> print obj['title'][0]
Planinarski dom Kalnik - 2 dana s doručkom za dvoje za 149kn umjesto 300kn!
Same aplies for Python strings.
>>> u"česnakas"
u'\u010desnakas'
>>> print u"česnakas"
česnakas
Upvotes: 1