Emu
Emu

Reputation: 5905

Scrapy json response convert in utf-8 encode

I've written the following code to scrap data from a site.

import scrapy
from porua_scrapper.items import Category
from porua_scrapper.config import SITE_URL


class CategoriesSpider(scrapy.Spider):
    name = "categories"
    start_urls = []
    for I in range(2):
        url = SITE_URL + "book/categories?page=" + str(I+1)
        start_urls.append(url)

    print(start_urls)


    def parse(self, response):
        # print(response.css('ul.categoryList li div.pFIrstCatCaroItem a').extract_first())

        for category in response.css('ul.categoryList li'):
            categoryObj = Category()

            categoryObj['name'] = category.css('div.bookSubjectCaption h2::text').extract_first()
            categoryObj['url'] = category.css('a::attr(href)').extract_first()

            yield categoryObj

When I run the command scrapy crawl categories -o categories.json it'll create a categories.json file which contains the desired output format. But the problem is some of my content contains bengali text. Thus, in the generated output file I got response like:

{"url": "/book/category/271/\u09a8\u09be\u099f\u0995", "name": "\u09a8\u09be\u099f\u0995"}

How am I supposed to encode the content in utf-8? As I'm new in scrapy I didn't manage to find a suitable solution based on my scenario.

Thanks in advance!

Upvotes: 8

Views: 8056

Answers (3)

Thiago Dias
Thiago Dias

Reputation: 21

To run in command-line use the option "--set FEED_EXPORT_ENCODING=utf-8":

scrapy runspider --set FEED_EXPORT_ENCODING=utf-8 .\TheScrapyScript.py -o TheOutputFile.json

Upvotes: 2

tae ha
tae ha

Reputation: 511

At settings.py, add the following line: FEED_EXPORT_ENCODING = 'utf-8'

Upvotes: 5

paul trmbrth
paul trmbrth

Reputation: 20748

First of all, {"url": "/book/category/271/\u09a8\u09be\u099f\u0995", "name": "\u09a8\u09be\u099f\u0995"} is valid JSON data

>>> import json
>>> d = json.loads('''{"url": "/book/category/271/\u09a8\u09be\u099f\u0995", "name": "\u09a8\u09be\u099f\u0995"}''')
>>> print(d['name'])
নাটক

and any program interpreting this data should understand (i.e. decode) the characters just fine. Python json module calls this ensure_ascii:

If ensure_ascii is true (the default), all non-ASCII characters in the output are escaped with \uXXXX sequences, and the result is a str instance consisting of ASCII characters only.

This is what Scrapy feed exporter uses by default for JSON output.

But if you need the output JSON file to use another encoding, such as UTF-8, you can use Scrapy's FEED_EXPORT_ENCODING setting.

FEED_EXPORT_ENCODING = 'utf-8'

Upvotes: 22

Related Questions