Scrapy: parse the data from multiple pages(pagination) and combine the yield output in single array

Question

What I'm trying to do is to scrape multiple pages and yield the result in a single array.

What I've tried so far:

import scrapy


class RealtorSpider(scrapy.Spider):
    name = "realtor"
    allowed_domains = ["realtor.com"]
    start_urls = ["http://realtor.com/"]

    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Sec-GPC": "1",
        "Connection": "keep-alive",
        "If-None-Match": '"d9b9d-uhdwucnqmaT5gbxbobPzbm+uEgs"',
        "Cache-Control": "max-age=0",
        "TE": "trailers",
    }

    def start_requests(self):
        url = "https://www.realtor.com/realestateandhomes-search/Seattle_WA/show-newest-listings"

        for page in range(1, 4):
            next_page = url + "/pg-" + str(page)
            yield scrapy.Request(
                url=next_page, headers=self.headers, callback=self.parse, priority=1
            )

    def parse(self, response):
        # extract data
        for card in response.css("ul.property-list"):
            item = {"price": card.css("span[data-label=pc-price]::text").getall()}
            yield item

which gives me three separate list of prices.

['$740,000', '$998,000', '$620,000', ......, '$719,000', '$2,975,000', '$1,099,000']
['$500,000', '$474,000', '$725,000', ......, '$895,000', '$619,500', '$1,199,000']
['$1,095,000', '$475,000', '$700,000', ........, '$950,000', '$995,000', '$639,950']

what I am looking for is to get one single list like this:

$740,000 - 1
$998,000 - 2
$620,000 - 3
$719,000 - 4
     .
     .
     .
$995,000 - 143
$639,950 - 144

TheFaultInOurStars · Accepted Answer

I am not sure what exactly resulted in the example list, but let's say you have called one of the functions in the RealtorSpider that actually resulted in getting three lists. Since these function uses yield to return the value you probably need to call list on the output of these function to have a list instead of a generator.

I suggest you edit your realtor.py file such as what follows:

import scrapy
import json

class RealtorSpider(scrapy.Spider):
    name = "realtor"
    allowed_domains = ["realtor.com"]
    start_urls = ["http://realtor.com/"]
    prices = []
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Sec-GPC": "1",
        "Connection": "keep-alive",
        "If-None-Match": '"d9b9d-uhdwucnqmaT5gbxbobPzbm+uEgs"',
        "Cache-Control": "max-age=0",
        "TE": "trailers",
    }

    def start_requests(self):
        url = "https://www.realtor.com/realestateandhomes-search/Seattle_WA/show-newest-listings"

        for page in range(1, 4):
            next_page = url + "/pg-" + str(page)
            yield scrapy.Request(
                url=next_page, headers=self.headers, callback=self.parse, priority=1
            )

    def parse(self, response):
        # extract data
        for card in response.css("ul.property-list"):
            item = {"price": card.css("span[data-label=pc-price]::text").getall()}
            self.prices.append(item["price"])
            yield item
        data = [x for y in self.prices for x in y]
        with open("data.json", "w") as f:
          f.write(json.dumps(data))

If you edit the file into this file, after running scrapy crawl realtor in shell, it will generate a file named data.json. This file is what exactly you want. Therefore, you can just read it:

import json
data = json.load(open("data.json"))
data

Output

['$575,000',
 '$399,950',
 '$620,000',
 '$1,150,000',
 '$1,100,000',
 '$880,000',
 '$735,000',
 '$337,000',
 '$759,800',
 '$330,000',
 '$575,000',
 '$740,000',
 '$639,950',
 '$950,000',
 '$575,000',
 '$895,000',
 '$950,000',
 '$675,000',
 '$629,000',
 '$2,000,000',
 '$1,325,000',
 '$714,900',
 '$699,950',
 '$998,000',
 '$1,150,000',
 '$849,999',
 '$999,000',
 '$1,050,000',
 '$750,000',
 '$2,975,000',
 '$1,300,000',
 '$1,350,000',
 '$400,000',
 '$1,349,000',
 '$1,175,000',
 '$1,049,000',
 '$3,500,000',
 '$849,000',
 '$719,000',
 '$734,950',
 '$1,099,000',
 '$769,000',
 '$489,000',
 '$1,095,000',
 '$700,000',
 '$475,000',
 '$450,000',
 '$625,000',
 '$330,000',
 '$425,000',
 '$685,000',
 '$385,000',
 '$649,950',
 '$815,000',
 '$699,000',
 '$525,000',
 '$1,495,000',
 '$325,000',
 '$835,000',
 '$599,950',
 '$1,150,000',
 '$895,000',
 '$998,900',
 '$775,000',
 '$565,000',
 '$750,000',
 '$879,000',
 '$325,000',
 '$1,000,000',
 '$785,000',
 '$725,000',
 '$899,000',
 '$1,095,000',
 '$1,175,000',
 '$815,000',
 '$2,300,000',
 '$950,000',
 '$929,000',
 '$1,249,900',
 '$1,650,000',
 '$1,500,000',
 '$639,950',
 '$995,000',
 '$750,000',
 '$630,000',
 '$999,000',
 '$474,000',
 '$390,000',
 '$485,000',
 '$725,000',
 '$500,000',
 '$340,000',
 '$689,000',
 '$525,000',
 '$650,000',
 '$589,950',
 '$665,000',
 '$725,000',
 '$460,000',
 '$749,450',
 '$1,088,000',
 '$525,000',
 '$495,000',
 '$830,000',
 '$475,000',
 '$999,000',
 '$849,950',
 '$848,000',
 '$480,000',
 '$538,000',
 '$4,585,000',
 '$1,150,000',
 '$1,045,000',
 '$730,000',
 '$630,000',
 '$1,950,000',
 '$899,000',
 '$1,975,000',
 '$1,179,500',
 '$2,100,000',
 '$829,000',
 '$2,750,000',
 '$895,000',
 '$849,950',
 '$619,500',
 '$1,199,000']

Scrapy: parse the data from multiple pages(pagination) and combine the yield output in single array

Answers (1)

Output

Related Questions