Reputation: 154
I'v created a script in scrapy to parse the titles of different sites listed in start_urls
. The script is doing it's job flawlessly.
What I wish to do now is let my script stop after two of the urls are parsed no matter how many urls are there.
I've tried so far with:
import scrapy
from scrapy.crawler import CrawlerProcess
class TitleSpider(scrapy.Spider):
name = "title_bot"
start_urls = ["https://www.google.com/","https://www.yahoo.com/","https://www.bing.com/"]
def parse(self, response):
yield {'title':response.css('title::text').get()}
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(TitleSpider)
c.start()
How can I make my script stop when two of the listed urls are scraped?
Upvotes: 5
Views: 913
Reputation: 3561
Currently I see the only one way to immediately stop this script - usage of os._exit
force exit function:
import os
import scrapy
from scrapy.crawler import CrawlerProcess
class TitleSpider(scrapy.Spider):
name = "title_bot"
start_urls = ["https://www.google.com/","https://www.yahoo.com/","https://www.bing.com/"]
item_counter =0
def parse(self, response):
yield {'title':response.css('title::text').get()}
self.item_counter+=1
print(self.item_counter)
if self.item_counter >=2:
self.crawler.stats.close_spider(self,"2 items")
os._exit(0)
if __name__ == "__main__":
c = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0' })
c.crawl(TitleSpider)
c.start()
Another things that I tried.
But I didn't received required result (immediately stop script afted 2 scraped items with only 3 urls in start_urls
):
CrawlerProcess
instance into spider settings and calling
CrawlerProcess.stop
,(reactor.stop
), etc.. and other methods
from parse
method.Usage of CloseSpider
extension docs source ) with following CrawlerProcess
definition:
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
'EXTENSIONS' : {
'scrapy.extensions.closespider.CloseSpider': 500,
},
"CLOSESPIDER_ITEMCOUNT":2 })
Reducing CONCURRENT_REQUESTS
setting to 1
(with raise CloseSpider
condition in parse
method).
When application scraped 2 items and it
reaches code line with raise ClosesSpider
- 3rd request already
started in another thread.
In case of usage conventional way to stop
spider, application will be active until it process previously sent
requests and process their responses and only after that - it
closes.
As your application has relatively low numbers of urls in start_urls, application starts process all urls a long before it reaches raise CloseSpider
.
Upvotes: 2
Reputation: 23
Enumerate do jobs fine. Some changes in architecture and
for cnt, url in enumerate(start_urls):
if cnt > 1:
break
else:
parse(url)
Upvotes: -1
Reputation: 3118
As Gallaecio proposed, you can add a counter, but the difference here is that you export an item after the if statement. This way, it will almost always end up exporting 2 items.
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.exceptions import CloseSpider
class TitleSpider(scrapy.Spider):
name = "title_bot"
start_urls = ["https://www.google.com/", "https://www.yahoo.com/", "https://www.bing.com/"]
item_limit = 2
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.counter = 0
def parse(self, response):
self.counter += 1
if self.counter > self.item_limit:
raise CloseSpider
yield {'title': response.css('title::text').get()}
Why almost always? you may ask. It has to do with race condition in parse
method.
Imagine that self.counter
is currently equal to 1
, which means that one more item is expected to be exported. But now Scrapy receives two responses at the same moment and invokes the parse
method for both of them. If two threads running the parse
method will increase the counter simultaneously, they will both have self.counter
equal to 3
and thus will both raise the CloseSpider
exception.
In this case (which is very unlikely, but still can happen), spider will export only one item.
Upvotes: 1
Reputation: 3847
Constructing on top of https://stackoverflow.com/a/38331733/939364, you can define a counter in the constructor of your spider, and use parse
to increase it and raise CloseSpider
when it reaches 2:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.exceptions import CloseSpider # 1. Import CloseSpider
class TitleSpider(scrapy.Spider):
name = "title_bot"
start_urls = ["https://www.google.com/","https://www.yahoo.com/","https://www.bing.com/"]
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.counter = 0 # 2. Define a self.counter property
def parse(self, response):
yield {'title':response.css('title::text').get()}
self.counter += 1 # 3. Increase the count on each parsed URL
if self.counter >= 2:
raise CloseSpider # 4. Raise CloseSpider after 2 URLs are parsed
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(TitleSpider)
c.start()
I am not 100% certain that it will prevent a third URL to be parsed, because I think CloseSpider
stops new requests from start but waits for started requests to finish.
If you want to prevent more than 2 items from being scraped, you can edit parse
not to yield items when self.counter > 2
.
Upvotes: 0