Ozcan
Ozcan

Reputation: 490

Unicode on Scrapy Json output

I'm having problem on json output of scrapy. Crawler works good, cli output works without a problem. XML item exporter works without a problem and output is saved with correct encoding, text is not escaped.

These won't work as my data includes sub branches.

Unicode text in json output file is escaped like this: "\u00d6\u011fretmen S\u00fcleyman Yurtta\u015f Cad."

But for xml output file it is correctly written: "Öğretmen Süleyman Yurttaş Cad."

Even changed the scrapy source code to include ensure_ascii=False for ScrapyJSONEncoder, but no use.

So, is there any way to enforce scrapyjsonencoder to not escape while writing to file.

Edit1: Btw, using Python 2.7.6 as scrapy does not support Python3.x

This is as standart scrapy crawler. A spider file, settings file and an items file. First the page list is crawled starting from base url then the content is scraped from those pages. Data pulled from the page is assigned to variables defined in items.py of the scrapy project, encoded in utf-8. There's no problem with that, as everything works good on XML output.

scrapy crawl --nolog --output=output.json -t json spidername

Xml output works without a problem with this command:

scrapy crawl --nolog --output=output.xml -t xml spidername

I have tried editing scrapy/contrib/exporter/init.py and scrapy/utils/serialize.py to insert ensure_ascii=False parameter to json.JSONencoder.

Edit2:

Tried debugging again.There's no problem up to Python2.7/json/encoder.py code. Data is intact and not escaped. After that, it gets hard to debug as the scrapy works async and there are lots of callbacks.

Edit3:

A bit of dirty hack, but after editing Python2.7.6/lib/json/encoder.py and changing ensure_ascii parameter to False, the problem seems to be solved.

Upvotes: 1

Views: 2405

Answers (2)

Marten Schlüter
Marten Schlüter

Reputation: 96

Add two parameters to your settings.py as described in the documentation:

FEED = 'json'
FEED_EXPORT_ENCODING = 'utf-8'

Upvotes: 1

Dev Pandu
Dev Pandu

Reputation: 121

As I don't have your code to test, Can you try to use codecs Try: import codecs f = codecs.open('yourfilename', 'your_mode', 'utf-8') f.write('whatever you want to write') f.close()

Upvotes: 1

Related Questions