Reputation: 187

Check if a URL is in file before scraping with Scrapy

I am scraping a large file containing a list of URLs. Obviously I cannot scrape all the URLs consecutively. My current solution reads a URL from the file. Once it crawls and downloads documents from that page, I write to a new file which looks something like this:

https://url_i_completed_crawling     E:/location_I_stored_crawled_files
https://another_url_i_completed_crawling     E:/another_location_I_stored_crawled_files

My issue is that when I stop the spider and try to continue where I left off, the program starts from the original text file of URLs and begins to recrawl and overwrite the previous downloads with the same content.

I tried to put code in the spider that checks if the URL passed into the parse function is in the "completed_urls.txt" file... but obviously this is a long check as the number of completed URLs grows.

So my question is this: how can I remember which URL was the last URL to be crawled, and have my spider start from the next URL in the text file when I restart the program.

    # file containing urls to crawl is passed in from command line
    # > scrapy crawl fbo-crawler -a filename=FBOSpider/urls_file.txt   
    def __init__(self, filename=None):
        if filename:
            with open(filename, 'r') as r:
                # here I want to check if r.readlines() is passing a URL that I have aleady crawled
                # crawld URLs are stored in a text file as shown above
                self.start_urls = r.readlines()

Upvotes: 0

Answers (3)

Wertartem

Reputation: 237

SCRAPY AND DELTAFETCH

This is a Scrapy spider middleware to ignore requests to pages containing items seen in previous crawls of the same spider, thus producing a "delta crawl" containing only new items.

First, install DeltaFetch using pip:

pip install scrapy-deltafetch

Then, you have to enable it in your project's settings.py file:

SPIDER_MIDDLEWARES = {
'scrapy_deltafetch.DeltaFetch': 100,
}
DELTAFETCH_ENABLED = True

Resetting DeltaFetch If you want to re-scrape pages, you can reset the DeltaFetch cache by passing the deltafetch_reset argument to your spider:

scrapy crawl test -a deltafetch_reset=1

You can check out the project page on github for further information:

Upvotes: 0

Georgiy

Reputation: 3561

According to scrapy docs:
Scrapy supports pausing and resuming crawls functionality out of the box.

Upvotes: 1

Mahrad Hanaforoosh

Reputation: 546

It may be a good idea to store such tabular schemes into tables. the relational databases are recommended for such purposes. because of indexing, accessing to data is faster. However, in your case removing the scraped URLs from the original file may be helpful.

Upvotes: 0

Check if a URL is in file before scraping with Scrapy

Answers (3)

Related Questions