Reputation: 491
I need to scrape data from multiple pages. First it should scrape data from the first page then from this page extract a url to the second page and get some data from it, too
All should be on the same csv row.
This is the first page: https://www.catalogs.ssg.asia/toyota/?fromchanged=true&lang=en&l=bWFya2V0PT1nZW5lcmFsfHxzdD09MjB8fHN0cz09eyIxMCI6IlJlZ2lvbiIsIjIwIjoiTWlkZGxlIEVhc3QifQ%3D%3D
example of the data is the first row on the table e.g:catalog, model, production, and series.
This is the second page: https://www.catalogs.ssg.asia/toyota/?fromchanged=true&lang=en&l=bWFya2V0PT1nZW5lcmFsfHxzdD09MzB8fHN0cz09eyIxMCI6IlJlZ2lvbiIsIjIwIjoiTWlkZGxlIEVhc3QiLCIzMCI6IjRSVU5ORVIgNjcxMzYwIn18fGNhdGFsb2c9PTY3MTM2MHx8cmVjPT1CMw%3D%3D example of the data: series, engine, production date.
both should be together on the same csv row like the screenshot:
This is my code:
import datetime
import urlparse
import socket
import scrapy
from scrapy.loader.processors import MapCompose, Join
from scrapy.loader import ItemLoader
from scrapy.http import Request
from properties.items import PropertiesItem
class BasicSpider(scrapy.Spider):
name = "manual"
# This is the page which i will hit middle est from.
start_urls = ["https://www.catalogs.ssg.asia/toyota/?fromchanged=true&lang=en"]
def parse(self, response):
# First page
next_selector ="https://www.catalogs.ssg.asia/toyota/?fromchanged=true&lang=en&l="+response.xpath('//*[@id="rows"]/tr[2]/@onclick').re(r"HM\.set\('([^']+)'")[0]
yield Request(next_selector, callback=self.parse_item)
def parse_item(self, response):
for tr in response.xpath("/html/body/table[2]/tr/td/table/tr")[1:]:
item = PropertiesItem()
item['Series']= tr.xpath("td[1]/text()").extract()
item['Engine']= tr.xpath("td[2]/text()").extract()
second_selector ="https://www.catalogs.ssg.asia/toyota/?fromchanged=true&lang=en&l="+response.xpath('/html/body/table[2]/tr/td/table/tr/@onclick').re(r"HM\.set\('([^']+)'")
yield item
def parse_item_2(self, response):
item = PropertiesItem()
item['Building_Condition']=response.xpath('/html/body/table[2]/tr/td/table/tr[2]/td[1]/text()').extract()
yield item
I need to write some code in parse item to go to parse_item_2 and handle the second page and get the results to be on the same csv row. How to do that?
Upvotes: 0
Views: 358
Reputation: 81
If you want to build a single item using data from different urls you should pass it from one Request object to the next using the meta attribute. Finally you yield the resulting item in order to write it into a single row.
def parse_item(self, response):
for tr in response.xpath("/html/body/table[2]/tr/td/table/tr")[1:]:
[...]
second_selector = [...]
meta = {'item': item}
yield Request(second_selector, meta=meta, callback=self.parse_item_2)
def parse_item_2(self, response):
item = PropertiesItem(response.meta['item'])
item['Building_Condition']=response.xpath('/html/body/table[2]/tr/td/table/tr[2]/td[1]/text()').extract()
yield item
Upvotes: 1