Reputation: 757
Here is my spider
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from vrisko.items import VriskoItem
class vriskoSpider(CrawlSpider):
name = 'vrisko'
allowed_domains = ['vrisko.gr']
start_urls = ['http://www.vrisko.gr/search/%CE%B3%CE%B9%CE%B1%CF%84%CF%81%CE%BF%CF%82/%CE%BA%CE%BF%CF%81%CE%B4%CE%B5%CE%BB%CE%B9%CE%BF']
rules = (Rule(SgmlLinkExtractor(allow=('\?page=\d')),'parse_start_url',follow=True),)
def parse_start_url(self, response):
hxs = HtmlXPathSelector(response)
vriskoit = VriskoItem()
vriskoit['eponimia'] = hxs.select("//a[@itemprop='name']/text()").extract()
vriskoit['address'] = hxs.select("//div[@class='results_address_class']/text()").extract()
return vriskoit
My problem is that the returned strings are unicode and i want to encode them to utf-8. I dont know which is the best way to do this. I tried several ways without result.
Thank you in advance!
Upvotes: 43
Views: 54428
Reputation: 1541
Now I can pass this setting as command-line parameter
>>>scrapy runspider blah.py -o myjayson.json -s FEED_EXPORT_ENCODING=utf-8
Upvotes: 2
Reputation: 31
You should add the statement FEED_EXPORT_ENCODING = 'utf-8'
to the setting file in your scrapy project.
Upvotes: 3
Reputation: 101
Try adding the following line to the config file for Scrapy (i.e. settings.py):
FEED_EXPORT_ENCODING = 'utf-8'
Upvotes: 10
Reputation: 49
I had a lot of problem due to encoding with python and scrapy. To be sure to avoid every encoding decoding problems, the best thing to do is to write :
unicode(response.body.decode(response.encoding)).encode('utf-8')
Upvotes: 4
Reputation: 1625
Since Scrapy 1.2.0, a new setting FEED_EXPORT_ENCODING
is introduced. By specifying it as utf-8
, JSON output will not be escaped.
That is to add in your settings.py
:
FEED_EXPORT_ENCODING = 'utf-8'
Upvotes: 112
Reputation: 21
As was mentioned earlier, JSON exporter writes unicode symbols escaped and it has option to write them as unicode ensure_ascii=False
.
To export items in utf-8 encoding you can add this to your project's settings.py
file:
from scrapy.exporters import JsonLinesItemExporter
class MyJsonLinesItemExporter(JsonLinesItemExporter):
def __init__(self, file, **kwargs):
super(MyJsonLinesItemExporter, self).__init__(file, ensure_ascii=False, **kwargs)
FEED_EXPORTERS = {
'jsonlines': 'yourproject.settings.MyJsonLinesItemExporter',
'jl': 'yourproject.settings.MyJsonLinesItemExporter',
}
Then run:
scrapy crawl spider_name -o output.jl
Upvotes: 0
Reputation: 21
I find a simple way to do that. It saves json data to 'SpiderName'.json with 'utf8'
from scrapy.exporters import JsonItemExporter
class JsonWithEncodingPipeline(object):
def __init__(self):
self.file = open(spider.name + '.json', 'wb')
self.exporter = JsonItemExporter(self.file, encoding='utf-8', ensure_ascii=False)
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
self.file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
Upvotes: 2
Reputation: 9522
Scrapy returns strings in unicode, not ascii. To encode all strings to utf-8, you can write:
vriskoit['eponimia'] = [s.encode('utf-8') for s in hxs.select('//a[@itemprop="name"]/text()').extract()]
But I think that you expect another result. Your code return one item with all search results. To return items for each result:
hxs = HtmlXPathSelector(response)
for eponimia, address in zip(hxs.select("//a[@itemprop='name']/text()").extract(),
hxs.select("//div[@class='results_address_class']/text()").extract()):
vriskoit = VriskoItem()
vriskoit['eponimia'] = eponimia.encode('utf-8')
vriskoit['address'] = address.encode('utf-8')
yield vriskoit
Update
JSON exporter writes unicode symbols escaped (e.g. \u03a4
) by default, because not all streams can handle unicode. It has option to write them as unicode ensure_ascii=False
(see docs for json.dumps) . But I can't find way to pass this option to standard feed exporter.
So if you want exported items to be written in utf-8
encoding, e.g. for read them in text editor, you can write custom item pipeline.
pipelines.py:
import json
import codecs
class JsonWithEncodingPipeline(object):
def __init__(self):
self.file = codecs.open('scraped_data_utf8.json', 'w', encoding='utf-8')
def process_item(self, item, spider):
line = json.dumps(dict(item), ensure_ascii=False) + "\n"
self.file.write(line)
return item
def spider_closed(self, spider):
self.file.close()
Don't forget to add this pipeline to settings.py:
ITEM_PIPELINES = ['vrisko.pipelines.JsonWithEncodingPipeline']
You can customize pipeline to write data in more human readable format, e.g. you can generate some formated report. JsonWithEncodingPipeline
is just basic example.
Upvotes: 38