Reputation: 2320
I've wrote a simple script to extract data from some site. Script works as expected but I'm not pleased with output format
Here is my code
class ArticleSpider(Spider):
name = "article"
allowed_domains = ["example.com"]
start_urls = (
"http://example.com/tag/1/page/1"
)
def parse(self, response):
next_selector = response.xpath('//a[@class="next"]/@href')
url = next_selector[1].extract()
# url is like "tag/1/page/2"
yield Request(urlparse.urljoin("http://example.com", url))
item_selector = response.xpath('//h3/a/@href')
for url in item_selector.extract():
yield Request(urlparse.urljoin("http://example.com", url),
callback=self.parse_article)
def parse_article(self, response):
item = ItemLoader(item=Article(), response=response)
# here i extract title of every article
item.add_xpath('title', '//h1[@class="title"]/text()')
return item.load_item()
I'm not pleased with the output, something like:
[scrapy] DEBUG: Scraped from <200 http://example.com/tag/1/article_name> {'title': [u'\xa0"\u0412\u041e\u041e\u0411\u0429\u0415-\u0422\u041e \u0421\u0412\u041e\u0411\u041e\u0414\u0410 \u0417\u0410\u041a\u0410\u041d\u0427\u0418\u0412\u0410\u0415\u0422\u0421\u042f"']}
I think I need to use custom ItemLoader class but I don't know how. Need your help.
TL;DR I need to convert text, scraped by Scrapy from unicode to utf-8
Upvotes: 5
Views: 4377
Reputation: 1529
There are 2 independant issues affecting display of unicode string.
if you return a list of strings, the output file will have some issue them because it will use ascii codec by default to serialize list elements. You can work around as below but it's more appropriate to use extract_first()
as suggested by @neverlastn
class Article(Item):
title = Field(serializer=lambda x: u', '.join(x))
the default implementation of repr() method will serialize unicode string to their escaped version \uxxxx
. You can change this behaviour by overriding this method in your item class
class Article(Item):
def __repr__(self):
data = self.copy()
for k in data.keys():
if type(data[k]) is unicode:
data[k] = data[k].encode('utf-8')
return super.__repr__(data)
Upvotes: 0
Reputation: 2204
As you can see below, this isn't much of a Scrapy issue but more of Python itself. It could also marginally be called an issue :)
$ scrapy shell http://censor.net.ua/resonance/267150/voobscheto_svoboda_zakanchivaetsya
In [7]: print response.xpath('//h1/text()').extract_first()
"ВООБЩЕ-ТО СВОБОДА ЗАКАНЧИВАЕТСЯ"
In [8]: response.xpath('//h1/text()').extract_first()
Out[8]: u'\xa0"\u0412\u041e\u041e\u0411\u0429\u0415-\u0422\u041e \u0421\u0412\u041e\u0411\u041e\u0414\u0410 \u0417\u0410\u041a\u0410\u041d\u0427\u0418\u0412\u0410\u0415\u0422\u0421\u042f"'
What you see is two different representations of the same thing - a unicode string.
What I would suggest is run crawls with -L INFO
or add LOG_LEVEL='INFO'
to your settings.py
in order to not show this output in the console.
One annoying thing is that when you save as JSON, you get escaped unicode JSON e.g.
$ scrapy crawl example -L INFO -o a.jl
gives you:
$ cat a.jl
{"title": "\u00a0\"\u0412\u041e\u041e\u0411\u0429\u0415-\u0422\u041e \u0421\u0412\u041e\u0411\u041e\u0414\u0410 \u0417\u0410\u041a\u0410\u041d\u0427\u0418\u0412\u0410\u0415\u0422\u0421\u042f\""}
This is correct but it takes more space and most applications handle equally well non-escaped JSON.
Adding a few lines in your settings.py
can change this behaviour:
from scrapy.exporters import JsonLinesItemExporter
class MyJsonLinesItemExporter(JsonLinesItemExporter):
def __init__(self, file, **kwargs):
super(MyJsonLinesItemExporter, self).__init__(file, ensure_ascii=False, **kwargs)
FEED_EXPORTERS = {
'jsonlines': 'myproject.settings.MyJsonLinesItemExporter',
'jl': 'myproject.settings.MyJsonLinesItemExporter',
}
Essentially what we do is just setting ensure_ascii=False
for the default JSON Item Exporters. This prevents escaping. I wish there was an easier way to pass arguments to exporters but I can't see any since they are initialized with their default arguments around here. Anyway, now your JSON file has:
$ cat a.jl
{"title": " \"ВООБЩЕ-ТО СВОБОДА ЗАКАНЧИВАЕТСЯ\""}
which is better-looking, equally valid and more compact.
Upvotes: 7