Hiroki
Hiroki

Reputation: 4173

Scrapy: scrape multiple pages and yield the results in a single array

What I'm trying to do is to scrape multiple pages and yield the result in a single array.

I found this post, which describes how to scrape multiple pages and yield a text from each of the scraped pages.

I referred to this approach (and modified it a bit), and here's my spider looks like...

from scrapy import Request
from test_project.items import PriceSpiderItem

class RoomsSpider(scrapy.Spider):
    name = 'rooms'
    allowed_domains = ['sample.com']
    start_urls = ['http://sample.com/rooms']

    def parse(self, response):
        for resource in response.xpath('.//*[@class="sample"]'):
            item = PriceSpiderItem()
            item['result'] = resource.xpath("text()").extract_first()
            yield item

        nextUrl = response.xpath('//*[@label="Next"]/@href').extract_first()

        if(nextUrl is not None):
            absoluteNextUrl = response.urljoin(nextUrl)
            yield Request(url=absoluteNextUrl, callback=self.parse)

However, with this approach, the result will look like...

{
 "items" : [
  {"result": "blah blah"},
  {"result": "blah blah blah blah blah"},
  {"result": "blah blah blah blah"},
  ...
  etc.
  ...
  {"result": "blah blah blah blah blah"},
  {"result": "blah blah blah"}
 ]
}

This is not exactly what I'm aiming to yield. Ideally, the result will be in a single array, like...

 {
  "items" : [
    "blah blah",
    "blah blah blah blah blah",
    "blah blah blah blah",
     ...
    "blah blah blah blah blah",
    "blah blah blah"
   ]
 }

However, I'm not sure whether it's achievable.

As far as I understand, Scrapy is non-blocking, so I might be able to store the result in a global variable and yield it after the spider crawled all the pages.

(That said, I wouldn't like to use a global variable, because it could be difficult to maintain the app as it grows bigger)

Any advice will be appreciated.

P.S.

@Wim Hermans gave me interesting approaches (Thank you!).

Among them, it'll be possible to store the results in a file with ItemPipeline and yield it after all the pages are crawled.

This seems very promising, but if the Spider is running on scrapyrt (or something similar) to work as a REST API endpoint, I'm not sure how to deal with concurrency issues.

# 1. Client A makes a request
# 2. Spider receives Client A's request
# 3. Client B makes a request
# 4. Spider receives Client B's request
# 5. Spider fulfills Client B's request, saves the result in "result.csv"
# 6. Spider fulfills Client A's request, updates "result.csv" with Client A's request
# 7. Spider responses with "result.csv" for bot Client A and B

Scrapy is non-blocking, so a scenario like this can happen, I suppose

P.P.S.

If you have to yield the result, the 1st solution presented by @Wim Hermans is probably the best solution (but be careful about the memory usage)

Upvotes: 2

Views: 1361

Answers (1)

Wim Hermans
Wim Hermans

Reputation: 2116

There's a couple of different options I can think of to achieve this:

  1. You pass the result in meta until the scraping finishes:
def parse(self, response):
    result = response.meta.get('result', [])
    for resource in response.xpath('.//*[@class="sample"]'):
        result.append(resource.xpath("text()").extract_first())

    nextUrl = response.xpath('//*[@label="Next"]/@href').extract_first()
    meta = {'result': result}
    if nextUrl:
        absoluteNextUrl = response.urljoin(nextUrl)
        yield Request(url=absoluteNextUrl, callback=self.parse, meta=meta)
    else:
        item = PriceSpiderItem()
        item['result'] = result
        yield item

Depending on how much data you'll be getting, this can become quite heavy.

  1. Write a custom item pipeline:

You don't pass around the full result-set in meta, but you write an item pipeline that keeps the results in a list and gives the results at the end.

class CombineResultsPipeline(object):
    def __init__(self):
        self.results = []

    def process_item(self, item, spider):
        self.results.append(item['result'])
        return item

    def close_spider(self, spider):
        print(f"full result set is {self.results}")

This is basically like storing the results in a global variable, so might not be exactly what you need either.

  1. Write to a file/database

A more memory-efficient option could be to write the results to a file (or to a database), and do some processing on it afterwards to get the results in the format you need. You can do this in an item pipeline (items to json) or just use Feed Exports for this (feed exports).

Upvotes: 5

Related Questions