gustavosobral
gustavosobral

Reputation: 13

Recursive crawling over a page

My problem is: I've a list (html - li) on the main page and for each component on the list i want to enter in another page, take some information, put it together in one item element, and interact over other antoher element on the main page list (html - li). I've done this first code, but i'm newbie with Python, Scrapy and i've found some dificultes to made the code.

I got this solution, but it generates two items for each main list element.

class BoxSpider(scrapy.Spider):
    name = "mag"
    start_urls = [
        "http://www.example.com/index.html"
    ]

    def secondPage(self, response):
        secondPageItem = CinemasItem()
        secondPageItem['trailer'] = 'trailer'
        secondPageItem['synopsis'] = 'synopsis'
        yield secondPageItem

    def parse(self, response):

        for sel in response.xpath('//*[@id="conteudoInternas"]/ul/li'):

            item = CinemasItem()
            item['title'] = 'title'
            item['room'] = 'room'
            item['mclass'] = 'mclass'
            item['minAge'] = 'minAge'
            item['cover'] = 'cover'
            item['sessions'] = 'sessions'

            secondUrl = sel.xpath('p[1]/a/@href').extract()[0]

            yield item
            yield scrapy.Request(url=secondUrl, callback=self.secondPage)

Can some one help me to generate just one item element with 'title', 'room', 'mclass', 'minAge', 'cover', 'sessions', 'trailer', 'synopsis' fields filled? Instead of one item with 'title', 'room', 'mclass', 'minAge', 'cover', 'sessions' fields filled and other with 'trailer', 'synopsis' filled?

Upvotes: 1

Views: 86

Answers (1)

alecxe
alecxe

Reputation: 474191

You need to pass the item instantiated in parse() inside the meta to the secondPage callback:

def parse(self, response):
    for sel in response.xpath('//*[@id="conteudoInternas"]/ul/li'):
        item = CinemasItem()
        item['title'] = 'title'
        item['room'] = 'room'
        item['mclass'] = 'mclass'
        item['minAge'] = 'minAge'
        item['cover'] = 'cover'
        item['sessions'] = 'sessions'

        secondUrl = sel.xpath('p[1]/a/@href').extract()[0]

        # see: we are passing the item inside the meta
        yield scrapy.Request(url=secondUrl, meta={'item': item}, callback=self.secondPage)

def secondPage(self, response):
    # see: we are getting the item from meta
    item = response.meta['item']

    item['trailer'] = 'trailer'
    item['synopsis'] = 'synopsis'
    yield item

Also see:

Upvotes: 1

Related Questions