Toleo
Toleo

Reputation: 774

Scraping multiple Pages in a Loop gives Duplicated results in 3rd Level

In my following HTML Templates:


Level 1 (Template_1)

<ul>
    <li>
        <a href="template_2.html">Template 2 (Level 2)</a>
    </li>
</ul>


Level 2 (Template_2)

<ul>
    <li>
        <a href="template_3.html">Template 3 (Level 3)</a>
    </li>
    <li>
        <a href="template_4.html">Template 4 (Level 3)</a>
    </li>
</ul>


Level 3 (Template_3 & Template_4)

<h1>Template 3 Text</h1>

<h1>Template 4 Text</h1>


What I'm trying to do is to enter Level 1 HTML Page, Then pull the Text of each a element then enter it to pull each h1 element Text using the following Spider:

# -*- coding: utf-8 -*-
import scrapy


class LESpider(scrapy.Spider):
    name = 'Loop Error'
    start_urls = ['template_1.html']

    def parse(self, response):
        data = {
            'temp_text': None,
            'text': None
        }

        yield scrapy.Request(url=response.css('a::attr(href)').extract_first(), callback=self.parse_lv2, dont_filter=True, meta={"data": data})

    def parse_lv2(self, response):
        for a in response.css('a'):
            data = response.meta.get('data')

            data['temp_text'] = a.css('a::text').extract_first()

            yield scrapy.Request(url=a.css('a::attr(href)').extract_first(), callback=self.parse_lv3, dont_filter=True, meta={"data": data})


    def parse_lv3(self, response):
        data = response.meta.get('data')

        data['text'] = response.css('h1::text').extract_first()

        yield data

My problem is like the following, Firstly the Result I expect is this

[
  {"temp_text": "Template 3 (Level 3)", "text": 'Template 3 Text'},
  {"temp_text": "Template 4 (Level 3)", "text": 'Template 4 Text'}
]

But what I get is the following result:

[
  {"temp_text": "Template 4 (Level 3)", "text": "Template 3 Text"},
  {"temp_text": "Template 4 (Level 3)", "text": "Template 4 Text"}
]

Where I get temp_text duplicated of the last value of the a elements in Level 2

I thought the problem is where I placed my yield data so I put it under the parse_lv2's yield like this

yield scrapy.Request(url=a.css('a::attr(href)').extract_first(), callback=self.parse_lv3, dont_filter=True, meta={"data": data})

yield data

But didn't get the data from parse_lv3,

Tried to check which part is the problem So I removed the parse_lv3 and the yield scrapy.Request(url=a.css('a::attr(href)').extract_first(), callback=self.parse_lv3, dont_filter=True, meta={"data": data}) from parse_lv2,

replacing it with yield data,

And the problem solved (wihtout parse_lv3 data),

So I'm sure the problem is either in the Loop or the parse_lv3 yield, But can't figure out How to solve this.

Upvotes: 0

Views: 45

Answers (1)

mxmn
mxmn

Reputation: 69

The problem is probably that you only define data once in parse so every loops in parse_lv2 will share the same dict data and every loop in parse_lv3 will share data too, that's why at the end you have the result of the last loop of parse_lv2in data['temp_text'].

You'd better initialize data in the loop of parse_lv2 like that

def parse(self, response):
    yield scrapy.Request(url=response.css('a::attr(href)').extract_first(), callback=self.parse_lv2, dont_filter=True)


def parse_lv2(self, response):
    for a in response.css('a'):
        data = dict()
        data['temp_text'] = a.css('a::text').extract_first()
        yield scrapy.Request(url=a.css('a::attr(href)').extract_first(), callback=self.parse_lv3, dont_filter=True, meta={"data": data})

Upvotes: 1

Related Questions