Alok
Alok

Reputation: 10534

saving scraped items to json/csv/xml file using scrapy

I am learning Scrapy(A web crawling framework) from their official doc.
By following examples and documentation I created my spider to scrape data using sitemap

from scrapy.contrib.spiders import SitemapSpider
from scrapy.selector import Selector
from MyProject1.items import MyProject1Item

class MySpider(SitemapSpider):
    name="myspider"
    sitemap_urls = ['http://www.somesite.com/sitemap.xml']
    sitemap_follow = ['/sitemapbrowsesection\d+']
    count=0
    def parse(self, response):
        self.count=self.count+1
        item=MyProject1Item()
        sel=Selector(response)
        item["exp"]=sel.xpath('/html/head/title/text()').extract()
        print str(self.count)+":\t"+response.url
        print sel.xpath('/html/head/title/text()').extract()
        print "\n\n"

I can see the scraped results on screen with log by running command
scrapy crawl myspider
I can save scraped results to json/csv/xml file by adding option in command
scrapy crawl myspider -o item.json -t json for getting results in json file.

My problem is that scrapy dumps scraped results to item.json only when scrapy is done with crawling. means have to wait till crawling get over. so for big project I will have to wait for very very long time as I guess scrapy will write scraped results to json file after all the crawling to be done.

I want scrapy to write to json file promptly or almost promptly or in the mid of crawling so I can see the results of sites which have crawled while scrapy is running.

I know there must be something inbuilt in scrapy that I am not able to catch. I tried to get help from http://doc.scrapy.org/en/latest/topics/feed-exports.html
and
http://doc.scrapy.org/en/latest/topics/exporters.html
but not able to solve my problem. so I am looking for some help or example code otherwise I have to add few lines to parse(self, response) function to create json/csv file and write scrapped result into it.

Upvotes: 4

Views: 4981

Answers (2)

Alok
Alok

Reputation: 10534

scrapy crawl myspider -o item.json -t json saves fetched results while crawling only. I wont have to wait for crawling to get finished. I can see content of file item.json while crawler is running

so no need to include code for writing fetched data to file in spider.

Upvotes: 0

R. Max
R. Max

Reputation: 6700

That's the way writing to a file works. There is a buffer that gets written to the disk only when it's full.

For example, in one shell open a file with python:

$ ipython

In [1]: fp = open('myfile', 'w')

In another shell monitor the file content:

$ tail -f myfile

Go back to the python shell and write some content:

In [2]: _ = [fp.write("This is my file content\n") for i in range(100)]

In my case, I don't see any content in the tail output. Write more content:

In [3]: _ = [fp.write("This is my file content\n") for i in range(100)]

Now I see the lines in the tail output.

In fact, you can change the file buffering (see [1]). Open again a file:

$ ipython

In [1]: fp = open('myfile', 'w', buffering=0)

Monitor the file content in another shell:

$ tail -f myfile

Write something and see the tail output:

In [2]: fp.write("Hello there\n")

It is good to have the buffering enabled (reduces the disk I/O). You items file will get the output eventually, but you might want to change the format to the default jsonlines (no -t argument needed), with that you get a json object per line. It's a highly used format for streaming.

You can read a jsonlines (.jl extension) easily:

import json

for line in open('items.jl'):
    data = json.loads(line)
    # do stuff with data

And even other tools like head + json.tool or jq (see [2]):

$ head -1 items.jl | python -m json.tool
$ jq -I . items.jl

I haven't seen any problem so far having a large job writing the items to a .jl file (or any other format). Nevertheless, if your job gets killed you will lost the last items in the buffer. This can be solved by storing the items in a db or something similar.

[1] http://docs.python.org/2/library/functions.html#open

[2] http://stedolan.github.io/jq/manual/

Upvotes: 1

Related Questions