ningyuwhut
ningyuwhut

Reputation: 629

how to crawl a limited number of pages from a site using scrapy?

I need to crawl a number of sites and I only want to crawl a certain number of pages each site. So how to implement this?

My thought is use a dict which the key is the domain name and the value is the number of pages that have been stored in mongodb. so when a page is crawled and stored in the database successfully then the number of pages of this domain will increase by one. if the number is greater than the maximum number then the spider should stop crwling from this site.

Below is my code but it didn't work. when spider.crawledPagesPerSite[domain_name] is greater than spider.maximumPagesPerSite:, the spiders is still crawling.

class AnExampleSpider(CrawlSpider):
name="anexample"
rules=(
    Rule(LinkExtractor(allow=r"/*.html"),
    callback="parse_url",follow=True),
)   
def __init__(self, url_file ): #, N=10,*a, **kw
    data = open(url_file, 'r').readlines() #[:N]
    self.allowed_domains = [ i.strip() for i in data ] 
    self.start_urls = ['http://' + domain for domain in self.allowed_domains]
    super(AnExampleSpider, self).__init__()#*a, **kw

    self.maximumPagesPerSite=100 #maximum pages each site
    self.crawledPagesPerSite={}
def parse_url(self, response):
    url=response.url
    item=AnExampleItem()     
    html_text=response.body
    extracted_text=parse_page.parse_page(html_text)
    item["url"]=url
    item["extracted_text"]=extracted_text
    return item

class MongoDBPipeline(object):
    def __init__(self):
        self.connection = pymongo.MongoClient( settings['MONGODB_SERVER'], settings['MONGODB_PORT'] )

    def process_item(self, item, spider):
        domain_name=tldextract.extract(item['url']).domain
        db = self.connection[domain_name] #use domain name as database name
        self.collection = db[settings['MONGODB_COLLECTION']]
        valid = True
        for data in item:
            if not data:
                valid = False
                raise DropItem("Missing {0}!".format(data))
            if valid:
                self.collection.insert(dict(item))
                log.msg("Item added to MongoDB database!",level=log.DEBUG, spider=spider)
                if domain_name in spider.crawledPagesPerSite:
                    spider.crawledPagesPerSite[domain_name]+=1
                else:
                    spider.crawledPagesPerSite[domain_name]=1
                if spider.crawledPagesPerSite[domain_name]>spider.maximumPagesPerSite:
                    suffix=tldextract.extract(item['url']).suffix
                    domain_and_suffix=domain_name+"."+suffix

                    if domain_and_suffix in spider.allowed_domains:
                        spider.allowed_domains.remove(domain_and_suffix)
                        spider.rules[0].link_extractor.allow_domains.remove(domain_and_suffix)
                        return None
                return item

Upvotes: 3

Views: 3704

Answers (4)

Mirza Bilal
Mirza Bilal

Reputation: 1050

If you are using SitemapSpider, you can use sitemap_filter, which is a proper way to filter entries.

I have posted the complete solution here in response to another similar question

Upvotes: 0

Geoffroy de Viaris
Geoffroy de Viaris

Reputation: 381

I am a beginer in Scrapy myself, however I combined two answers from other StackOverflow posts to find a solution that works for me. Let's say you want to stop scraping afer N pages, then you can import the CloseSpider exception as such :

# To import it :
from scrapy.exceptions import CloseSpider


#Later to use it:
raise CloseSpider('message')

You can for example integrate it to the parser to Close the spider after N urls :

N = 10 # Here change 10 to how many you want.
count = 0 # The count starts at zero.

def parse(self, response):
    # Return if more than N
    if self.count >= self.N:
        raise CloseSpider(f"Scraped {self.N} items. Eject!")
    # Increment to count by one:
    self.count += 1

    # Put here the rest the code for parsing

Link to the original posts I found :

  1. Force spider to stop in scrapy
  2. Scrapy: How to limit number of urls scraped in SitemapSpider

Upvotes: 1

Hamza Rana
Hamza Rana

Reputation: 137

I am not sure if this is what you're looking for, but I use this approach for scraping a certain number of pages only. Let's say I want to scrape only the starting 99 pages from example.com, I'll go about it the following way:

start_urls = ["https://example.com/page-%s.htm" % page for page in list(range(100))]

The code will stop working after reaching to page#99. But this only works when you have urls that have page numbers in them.

Upvotes: 3

Ali Nikneshan
Ali Nikneshan

Reputation: 3502

what about this:

def parse_url(self, response):
    url = response.url
    domain_name = tldextract.extract(url).domain
    if domain_name in self.crawledPagesPerSite:
        # If enough page visited in this domain, return
        if self.crawledPagesPerSite[domain_name] > self.maximumPagesPerSite:
            return 
        self.crawledPagesPerSite[domain_name]+=1

    else:
        self.crawledPagesPerSite[domain_name]=1
    print self.crawledPagesPerSite[domain_name]

Upvotes: -1

Related Questions