Reputation: 10534
I am learning Scrapy(A web crawling framework) from their official doc.
By following examples and documentation I created my spider to scrape data using sitemap
from scrapy.contrib.spiders import SitemapSpider
from scrapy.selector import Selector
from MyProject1.items import MyProject1Item
class MySpider(SitemapSpider):
name="myspider"
sitemap_urls = ['http://www.somesite.com/sitemap.xml']
sitemap_follow = ['/sitemapbrowsesection\d+']
count=0
def parse(self, response):
self.count=self.count+1
item=MyProject1Item()
sel=Selector(response)
item["exp"]=sel.xpath('/html/head/title/text()').extract()
print str(self.count)+":\t"+response.url
print sel.xpath('/html/head/title/text()').extract()
print "\n\n"
I can see the scraped results on screen with log by running command
scrapy crawl myspider
I can save scraped results to json/csv/xml file by adding option in command
scrapy crawl myspider -o item.json -t json
for getting results in json file.
My problem is that scrapy dumps scraped results to item.json only when scrapy is done with crawling. means have to wait till crawling get over. so for big project I will have to wait for very very long time as I guess scrapy will write scraped results to json file after all the crawling to be done.
I want scrapy to write to json file promptly or almost promptly or in the mid of crawling so I can see the results of sites which have crawled while scrapy is running.
I know there must be something inbuilt in scrapy that I am not able to catch. I tried to get help from http://doc.scrapy.org/en/latest/topics/feed-exports.html
and
http://doc.scrapy.org/en/latest/topics/exporters.html
but not able to solve my problem. so I am looking for some help or example code otherwise I have to add few lines to parse(self, response)
function to create json/csv file and write scrapped result into it.
Upvotes: 4
Views: 4981
Reputation: 10534
scrapy crawl myspider -o item.json -t json
saves fetched results while crawling only. I wont have to wait for crawling to get finished. I can see content of file item.json while crawler is running
so no need to include code for writing fetched data to file in spider.
Upvotes: 0
Reputation: 6700
That's the way writing to a file works. There is a buffer that gets written to the disk only when it's full.
For example, in one shell open a file with python:
$ ipython
In [1]: fp = open('myfile', 'w')
In another shell monitor the file content:
$ tail -f myfile
Go back to the python shell and write some content:
In [2]: _ = [fp.write("This is my file content\n") for i in range(100)]
In my case, I don't see any content in the tail
output. Write more content:
In [3]: _ = [fp.write("This is my file content\n") for i in range(100)]
Now I see the lines in the tail
output.
In fact, you can change the file buffering (see [1]). Open again a file:
$ ipython
In [1]: fp = open('myfile', 'w', buffering=0)
Monitor the file content in another shell:
$ tail -f myfile
Write something and see the tail
output:
In [2]: fp.write("Hello there\n")
It is good to have the buffering enabled (reduces the disk I/O). You items file will get the output eventually, but you might want to change the format to the default jsonlines
(no -t
argument needed), with that you get a json object per line. It's a highly used format for streaming.
You can read a jsonlines
(.jl
extension) easily:
import json
for line in open('items.jl'):
data = json.loads(line)
# do stuff with data
And even other tools like head
+ json.tool
or jq
(see [2]):
$ head -1 items.jl | python -m json.tool
$ jq -I . items.jl
I haven't seen any problem so far having a large job writing the items to a .jl
file (or any other format). Nevertheless, if your job gets killed you will lost the last items in the buffer. This can be solved by storing the items in a db or something similar.
[1] http://docs.python.org/2/library/functions.html#open
[2] http://stedolan.github.io/jq/manual/
Upvotes: 1