Reputation: 3
I am currently facing some issues about encoding.
As I am French, I frequently use characters like é
or è
.
I am trying to figure out why they are not displayed in a JSON file I created automatically with scrapy
...
Here is my python code :
# -*- coding: utf-8 -*-
import scrapy
class BlogSpider(scrapy.Spider):
name = 'pokespider'
start_urls = [
"https://www.pokepedia.fr/Liste_des_Pok%C3%A9mon_par_apport_en_EV"]
def parse(self, response):
for poke in response.css('table.tableaustandard.sortable tr')[1:]:
num = poke.css('td ::text').extract_first()
nom = poke.css('td:nth-child(3) a ::text').extract_first()
yield {'numero': int(num), 'nom': nom}
Then, after typing the scrapy
command, the code produces a JSON file. Here are its first lines :
[
{"numero": 1, "nom": "Bulbizarre"},
{"numero": 2, "nom": "Herbizarre"},
{"numero": 3, "nom": "Florizarre"},
{"numero": 4, "nom": "Salam\u00e8che"},
...
]
(Yes, these are French Pokémon names.)
So, I would like to get rid of this \u00e8
character, it should be an è
...
Is there a way to do this?
Thank you in advance, and I hope my English is not too poor :)
Upvotes: 0
Views: 592
Reputation: 2619
Use FEED_EXPORT_ENCODING option: here in custom_settings.
import scrapy
class BlogSpider(scrapy.Spider):
name = 'pokespider'
custom_settings = {'FEED_EXPORT_ENCODING': 'utf-8'}
start_urls = [
"https://www.pokepedia.fr/Liste_des_Pok%C3%A9mon_par_apport_en_EV"]
def parse(self, response):
for poke in response.css('table.tableaustandard.sortable tr')[1:]:
num = poke.css('td ::text').extract_first()
nom = poke.css('td:nth-child(3) a ::text').extract_first()
yield {'numero': int(num), 'nom': nom}
process = CrawlerProcess(settings={
"FEEDS": {
"items_json": {"format": "json"},
},
})
process.crawl(BlogSpider)
process.start()
Upvotes: 1