Jimmy Sanchez
Jimmy Sanchez

Reputation: 741

Running Scrapy in a For Loop hangs after first run

I want to run Scrapy in a for loop, one loop for each URL in a list. (NB: I don't want all these URLs as start_urls, I need them to run one at a time).

My first try gave me twisted.internet.error.ReactorNotRestartable errors after the first iteration of the loop.

A search on SO gave a previous answers that says process.start(stop_after_crawl=False) should solve this problem. This got rid of the Twisted error, but now just hangs after the first iteration of the loop. This is not a duplicate of that question.

My current code is:

for url in urls:
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
        'DEPTH_LIMIT': 4
    })

    process.crawl(MySpider, url)
    process.start(stop_after_crawl=False)

The first URL runs fine, then it just hangs:

 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 8, 12, 21, 12, 29, 963422)}
2018-08-12 22:12:30 [scrapy.core.engine] INFO: Spider closed (finished)

Upvotes: 0

Views: 1259

Answers (2)

notorious.no
notorious.no

Reputation: 5107

You should be able to use a few Twisted modules to do it. Here's a quick example:

from scrapy.crawler import CrawlerRunner
from twisted.internet import defer, tasks

@tasks.react
@defer.inlineCallbacks
def crawl_my_sites(reactor):
    runner = CrawlerRunner({})
    for url in urls:
        yield runner.crawl(MySpider, url)

Upvotes: 1

Thomas Strub
Thomas Strub

Reputation: 1285

For looping through a list with scrapy I think using "start_requests" is a good idea:

def start_requests(self):
    with open('./input/id_urls_10k.csv','r') as csvfile:
        urlreader = csv.reader(csvfile, delimiter=',',quotechar='"')
        for row in urlreader:
            if row[1]=="y":
                yield scrapy.Request(url=row[2],meta={'urlkey':row[0]})

Upvotes: 1

Related Questions