Hat hout
Hat hout

Reputation: 491

how to scrape data from multiple pages in the same csv row?

I need to scrape data from multiple pages. First it should scrape data from the first page then from this page extract a url to the second page and get some data from it, too

All should be on the same csv row.

This is the first page: https://www.catalogs.ssg.asia/toyota/?fromchanged=true&lang=en&l=bWFya2V0PT1nZW5lcmFsfHxzdD09MjB8fHN0cz09eyIxMCI6IlJlZ2lvbiIsIjIwIjoiTWlkZGxlIEVhc3QifQ%3D%3D

example of the data is the first row on the table e.g:catalog, model, production, and series.

This is the second page: https://www.catalogs.ssg.asia/toyota/?fromchanged=true&lang=en&l=bWFya2V0PT1nZW5lcmFsfHxzdD09MzB8fHN0cz09eyIxMCI6IlJlZ2lvbiIsIjIwIjoiTWlkZGxlIEVhc3QiLCIzMCI6IjRSVU5ORVIgNjcxMzYwIn18fGNhdGFsb2c9PTY3MTM2MHx8cmVjPT1CMw%3D%3D example of the data: series, engine, production date.

both should be together on the same csv row like the screenshot: enter image description here

This is my code:

import datetime
import urlparse
import socket
import scrapy

from scrapy.loader.processors import MapCompose, Join
from scrapy.loader import ItemLoader
from scrapy.http import Request

from properties.items import PropertiesItem


class BasicSpider(scrapy.Spider):
    name = "manual"


    # This is the page which i will hit middle est from.
    start_urls = ["https://www.catalogs.ssg.asia/toyota/?fromchanged=true&lang=en"]


    def parse(self, response):
        # First page
        next_selector ="https://www.catalogs.ssg.asia/toyota/?fromchanged=true&lang=en&l="+response.xpath('//*[@id="rows"]/tr[2]/@onclick').re(r"HM\.set\('([^']+)'")[0]
        yield Request(next_selector, callback=self.parse_item)

    def parse_item(self, response):
        for tr in response.xpath("/html/body/table[2]/tr/td/table/tr")[1:]:
            item = PropertiesItem()

            item['Series']= tr.xpath("td[1]/text()").extract()
            item['Engine']= tr.xpath("td[2]/text()").extract()
            second_selector ="https://www.catalogs.ssg.asia/toyota/?fromchanged=true&lang=en&l="+response.xpath('/html/body/table[2]/tr/td/table/tr/@onclick').re(r"HM\.set\('([^']+)'")

            yield item

    def parse_item_2(self, response):
        item = PropertiesItem()
        item['Building_Condition']=response.xpath('/html/body/table[2]/tr/td/table/tr[2]/td[1]/text()').extract()
        yield item

I need to write some code in parse item to go to parse_item_2 and handle the second page and get the results to be on the same csv row. How to do that?

Upvotes: 0

Views: 358

Answers (1)

Fran
Fran

Reputation: 81

If you want to build a single item using data from different urls you should pass it from one Request object to the next using the meta attribute. Finally you yield the resulting item in order to write it into a single row.

def parse_item(self, response):
    for tr in response.xpath("/html/body/table[2]/tr/td/table/tr")[1:]:
        [...]
        second_selector = [...]
        meta = {'item': item}
        yield Request(second_selector, meta=meta, callback=self.parse_item_2)

    def parse_item_2(self, response):
        item = PropertiesItem(response.meta['item'])
        item['Building_Condition']=response.xpath('/html/body/table[2]/tr/td/table/tr[2]/td[1]/text()').extract()
        yield item

Upvotes: 1

Related Questions