Reputation: 774
In my following HTML Templates:
Level 1 (Template_1)
<ul>
<li>
<a href="template_2.html">Template 2 (Level 2)</a>
</li>
</ul>
Level 2 (Template_2)
<ul>
<li>
<a href="template_3.html">Template 3 (Level 3)</a>
</li>
<li>
<a href="template_4.html">Template 4 (Level 3)</a>
</li>
</ul>
Level 3 (Template_3 & Template_4)
<h1>Template 3 Text</h1>
<h1>Template 4 Text</h1>
What I'm trying to do is to enter Level 1 HTML
Page, Then pull the Text of each a
element then enter it to pull each h1
element Text using the following Spider:
# -*- coding: utf-8 -*-
import scrapy
class LESpider(scrapy.Spider):
name = 'Loop Error'
start_urls = ['template_1.html']
def parse(self, response):
data = {
'temp_text': None,
'text': None
}
yield scrapy.Request(url=response.css('a::attr(href)').extract_first(), callback=self.parse_lv2, dont_filter=True, meta={"data": data})
def parse_lv2(self, response):
for a in response.css('a'):
data = response.meta.get('data')
data['temp_text'] = a.css('a::text').extract_first()
yield scrapy.Request(url=a.css('a::attr(href)').extract_first(), callback=self.parse_lv3, dont_filter=True, meta={"data": data})
def parse_lv3(self, response):
data = response.meta.get('data')
data['text'] = response.css('h1::text').extract_first()
yield data
My problem is like the following, Firstly the Result I expect is this
[
{"temp_text": "Template 3 (Level 3)", "text": 'Template 3 Text'},
{"temp_text": "Template 4 (Level 3)", "text": 'Template 4 Text'}
]
But what I get is the following result:
[
{"temp_text": "Template 4 (Level 3)", "text": "Template 3 Text"},
{"temp_text": "Template 4 (Level 3)", "text": "Template 4 Text"}
]
Where I get temp_text
duplicated of the last value of the a
elements in Level 2
I thought the problem is where I placed my yield data
so I put it under the parse_lv2's yield
like this
yield scrapy.Request(url=a.css('a::attr(href)').extract_first(), callback=self.parse_lv3, dont_filter=True, meta={"data": data})
yield data
But didn't get the data from parse_lv3,
Tried to check which part is the problem So I removed the parse_lv3 and the yield scrapy.Request(url=a.css('a::attr(href)').extract_first(), callback=self.parse_lv3, dont_filter=True, meta={"data": data})
from parse_lv2,
replacing it with yield data
,
And the problem solved (wihtout parse_lv3 data),
So I'm sure the problem is either in the Loop or the parse_lv3 yield
, But can't figure out How to solve this.
Upvotes: 0
Views: 45
Reputation: 69
The problem is probably that you only define data
once in parse
so every loops in parse_lv2
will share the same dict data
and every loop in parse_lv3
will share data
too, that's why at the end you have the result of the last loop of parse_lv2
in data['temp_text']
.
You'd better initialize data
in the loop of parse_lv2
like that
def parse(self, response):
yield scrapy.Request(url=response.css('a::attr(href)').extract_first(), callback=self.parse_lv2, dont_filter=True)
def parse_lv2(self, response):
for a in response.css('a'):
data = dict()
data['temp_text'] = a.css('a::text').extract_first()
yield scrapy.Request(url=a.css('a::attr(href)').extract_first(), callback=self.parse_lv3, dont_filter=True, meta={"data": data})
Upvotes: 1