Reputation: 187
I am scraping a large file containing a list of URLs. Obviously I cannot scrape all the URLs consecutively. My current solution reads a URL from the file. Once it crawls and downloads documents from that page, I write to a new file which looks something like this:
https://url_i_completed_crawling E:/location_I_stored_crawled_files
https://another_url_i_completed_crawling E:/another_location_I_stored_crawled_files
My issue is that when I stop the spider and try to continue where I left off, the program starts from the original text file of URLs and begins to recrawl and overwrite the previous downloads with the same content.
I tried to put code in the spider that checks if the URL passed into the parse function is in the "completed_urls.txt" file... but obviously this is a long check as the number of completed URLs grows.
So my question is this: how can I remember which URL was the last URL to be crawled, and have my spider start from the next URL in the text file when I restart the program.
# file containing urls to crawl is passed in from command line
# > scrapy crawl fbo-crawler -a filename=FBOSpider/urls_file.txt
def __init__(self, filename=None):
if filename:
with open(filename, 'r') as r:
# here I want to check if r.readlines() is passing a URL that I have aleady crawled
# crawld URLs are stored in a text file as shown above
self.start_urls = r.readlines()
Upvotes: 0
Views: 996
Reputation: 237
SCRAPY AND DELTAFETCH
This is a Scrapy spider middleware to ignore requests to pages containing items seen in previous crawls of the same spider, thus producing a "delta crawl" containing only new items.
First, install DeltaFetch using pip:
pip install scrapy-deltafetch
Then, you have to enable it in your project's settings.py file:
SPIDER_MIDDLEWARES = {
'scrapy_deltafetch.DeltaFetch': 100,
}
DELTAFETCH_ENABLED = True
Resetting DeltaFetch If you want to re-scrape pages, you can reset the DeltaFetch cache by passing the deltafetch_reset argument to your spider:
scrapy crawl test -a deltafetch_reset=1
You can check out the project page on github for further information:
Upvotes: 0
Reputation: 3561
According to scrapy docs:
Scrapy supports pausing and resuming crawls functionality out of the box.
Upvotes: 1
Reputation: 546
It may be a good idea to store such tabular schemes into tables. the relational databases are recommended for such purposes. because of indexing, accessing to data is faster. However, in your case removing the scraped URLs from the original file may be helpful.
Upvotes: 0