Is it possible to run pipelines and crawl multiple URL at the same time in scrapy?

Question

My spider looks like this

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.http import Request
from ProjectName.items import ProjectName

class SpidernameSpider(CrawlSpider):
    name = 'spidername'
    allowed_domains = ['webaddress']
    start_urls = ['webaddress/query1']

    rules = (
            Rule(LinkExtractor(restrict_css='horizontal css')),
            Rule(LinkExtractor(restrict_css='vertical css'),
                     callback='parse_item')
            )

    def parse_item(self, response):
        item = ProjectName()
        1_css = 'css1::text'
        item['1'] = response.css(1_css).extract()

        item = ProjectName()
        2_css = 'css2::text'
        item['2'] = response.css(2_css).extract()
        return item

and my pipeline like this:

from scrapy.exceptions import DropItem

class RemoveIncompletePipeline(object):
    def reminc_item(self, item, spider):
        if item['1']:
            return item
        else:
            raise DropItem("Missing content in %s" % item)

Everything works fine, when the value for field 1 is missing then, the coresponding item is taken out from the output.

But, when I change start_urls, in order to do the job for multiple queries, like this:

f = open("queries.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()

or like this:

start_urls = [i.strip() for i in open('queries.txt').readlines()]

Then the output contains the items with missing value for field 1.

What's going on? And how I can avoid that?

For the record queries.txt looks like that:

webaddress/query1
webaddress/query2

Danil · Accepted Answer

According to the docs you should override start_requests method.

This method must return an iterable with the first Requests to crawl for this spider.

This is the method called by Scrapy when the spider is opened for scraping when no particular URLs are specified. If particular URLs are specified, the make_requests_from_url() is used instead to create the Requests. This method is also called only once from Scrapy, so it’s safe to implement it as a generator.

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.http import Request
from ProjectName.items import ProjectName

class SpidernameSpider(CrawlSpider):
    name = 'spidername'
    allowed_domains = ['webaddress']
    start_urls = ['webaddress/query1']

    rules = (
            Rule(LinkExtractor(restrict_css='horizontal css')),
            Rule(LinkExtractor(restrict_css='vertical css'),
                     callback='parse_item')
            )

    def start_requests(self):
        return [scrapy.Request(i.strip(), callback=self.parse_item) for i in open('queries.txt').readlines()]

    def parse_item(self, response):
        item = ProjectName()
        1_css = 'css1::text'
        item['1'] = response.css(1_css).extract()

        item = ProjectName()
        2_css = 'css2::text'
        item['2'] = response.css(2_css).extract()
        return item

UPD: Just put this code into your spider class

def start_requests(self):
    return [scrapy.Request(i.strip(), callback=self.parse_item) for i in open('queries.txt').readlines()]

UPD: Your have a wrong logic in your parse_item method. You need to fix it.

def parse_item(self, response):
    for job in response.css('div.card-top')
        item = ProjectName()
        # just quick example.
        item['city'] = job.xpath('string(//span[@class="serp-location"])').extract()[0].replace(' ', '').replace('
', '')
        # TODO: you should fill other item fields
        # ...
        yeild item

Is it possible to run pipelines and crawl multiple URL at the same time in scrapy?

Answers (1)

Related Questions