Reputation: 372
I'm trying to get a Scrapy spider to crawl a website's pages in the order my spider code makes the Request() calls.
It's similar to this question: Scrapy Crawl URLs in Order
I examined the responses for this question and I tried them but none of them work quite the way I need to.
My problem is that I need to scrape a table on a page. Each table has a and one of the values is an href to another page. The first callback method scrapes the table, then makes a subsequent Request() call using the href to another page. The first page will make calls to many other pages. I pass along the data from the first callback method in a dict using the meta keyword to the second callback method.
The second callback method scrapes the contents of that page and adds the parsed data to the dict that it was passed to. But the data from the first callback isn't always for the same game as the data from the second callback.
The XML document of the first page looks like this:
# Game Schedule page
<html>
<body>
<div>
<table type="games">
<tbody>
<tr row="1">
<th data="week_number">1</th>
<td data="date">"9/13/2020"</td>
<td data="game_id">1</td>
<td data="game_summary"><a href="/game/20200913_01.html">game stats</a></td>
</tr>
<tr row="2">
<th data="week_number">1</th>
<td data="date">"9/13/2020"</td>
<td data="game_id">2</td>
<td data="game_summary"><a href="/game/20200913_02.html">game stats</a></td>
</tr>
<tr row="3">
<th data="week_number">1</th>
<td data="date">"9/13/2020"</td>
<td data="game_id">3</td>
<td data="game_summary"><a href="/game/20200913_03.html">game stats</a></td>
</tr>
</tbody>
</table>
</div>
</body>
</html>
Of course, there are more than 3 <TR> entries but I will just list 3 rows for the example.
Each of the <TR> rows also has an href to another page that has the game stat summary for each of the game. A sample game stat summary page would look something like this (it's not the exact copy but close enough for this example):
# A sample Game Stat summary page
<html>
<body>
<h1>Team A @ Team B</h1>
<div class="game stat">
<table type="game stat">
<tbody>
<tr row="1">
<td data="date">"9/13/2020"</td>
<td data="game_id">1</td>
<td data="game_time">2:00PM EST</td>
<td data="visit_team">Team A</td>
<td data="visit_team_score">43</td>
<td data="home_team">Team B</td>
<td data="home_team_score">53</td>
</tr>
</tbody>
</table>
</body>
</html>
My spider scrapes the Game Summary Page, parses each of the <TR> entries using XPath, then stores it in a dict.
import os
import sys
import urlparse
from lxml import etree, html
from scrapy.http import Request
from scrapy.loader import ItemLoader
class TestSpider(scrapy.Spider):
name = "test_spider"
season_flag = False
season_val = ""
"""
I need to override the __init__() method of scrapy.Spider
because I need to define some attributes/variables
"""
def __init__(self, *a, **kw):
super(TestSpider, self).__init__(*a, **kw)
self.season_flag = False
self.debug_flag = False
self.season_val = ""
self.game_list = list()
self.game_dict = dict()
if hasattr(self, "season"):
self.season_val = str(self.season)
self.season_flag = True
else:
self.log("No season argument. Exiting")
sys.exit(1)
if hasattr(self, "debug"):
if self.debug == True:
self.debug_flag = True
"""
Start the request by starting the scraping at
page that has the game schedule in a table
"""
def start_requests(self):
url_list = [
"https://somewebsite.com/2019.GameSchedule.htm"
]
for url in url_list:
yield Request(url=url,\
callback=self.parse_schedule_summary_page)
The callback method "parse_schedule_summary_page" parses the <TR>s from the Game Schedule page, including the URL to each game's game stat summary page.
It makes a "yield Request()" call and part of the Request()'s argument is the game_dict, using the "meta" keyword.
def parse_schedule_summary_page(self, response):
"""
Convert the response object to an lxml tree object.
"""
decoded = response.body.decode('utf-8')
html_tree = html.fromstring(decoded)
l_game_elem_list = list()
# This extracts all the <TR>s from the 'games' table and stores it in a list
l_game_elem_list = html_tree.xpath("//table[@type = 'games']/tbody/tr")
num_l_games = len(l_game_elem_list)
# Iterate thru each of the <TR> elements
for i in range(num_l_games):
game_dict = dict()
"""
Parse the week number, date, game id, and URL to
the game stat page
"""
l_game_elem = l_game_elem_list[i]
p_weeknum = l_game_elem.xpath(".//th[@data = 'week_num']/text()")
p_date = l_game_elem.xpath(".//td[@data = 'date']/text()")
p_game_id = l_game_elem.xpath(".//td[@data = 'game_id']/text()")
summary_url = l_game_elem.xpath("string(.//a[string() = 'game stats']/@href)")
game_dict['week_num'] = p_weeknum
game_dict['date'] = p_date
game_dict['game_id'] = p_game_id
# This is where the code gets wonky
yield Request(summary_url, priority=5, meta={'dict': game_dict},\
callback=self.parse_game_page)
parse_schedule_summary_page() then makes a Request call to the game stat summary page and its callback method is parse_game_page
def parse_game_page(self, response):
game_dict = response.meta.get('dict')
"""
Convert the response object to an lxml tree object.
"""
decoded = response.body.decode('utf-8')
html_tree = html.fromstring(decoded)
game_date = xpath("//td[@data = 'date']/text()")
game_id = xpath("//td[@data = 'game_id']/text()")
game_time = xpath("//td[@data = 'game_time']/text()")
v_team = xpath("//td[@data = 'visit_team']/text()")
v_team_score = xpath("//td[@data = 'visit_team_score']/text()")
h_team = xpath("//td[@data = 'home_team']/text()")
h_team_score = xpath("//td[@data = 'home_team_score']/text()")
game_dict['game_time'] = game_time
game_dict['x_game_id'] = game_id
# I copy the rest of the values I parsed via XPath into
# the game_dict dictionary. I won't repeat the code here
# for brevity's sake.
# Here's where I print it out, for debugging purposes
stmt = "==**==**==**==\n"
stmt += str(game_dict)
stmt += "\n**==**==**==**"
self.log(stmt)
From the output of the self.log(stmt) statement, I've noticed that the game_id entry and the x_game_id entry are not the same when they should be:
==**==**==**==
{'v_team': 'Team A', 'game_time': '6:30PM','game_id': '1', 'h_team_score': '53',
'h_team': 'Team B', 'week_num': '1', 'date': '9/13/2020', 'x_game_id': '7', 'v_team_score': '43'}
**==**==**==**
The game_id from parse_schedule_summary_page() does not match the x_game_id from parse_game_page(). It is like this for most, but not all of the games.
Referring to the earlier question above, this is because Scrapy cannot ensure the order in the URLs looked at.
From its recommendations, I first changed this configuration in my settings.py file:
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 1
It didn't help; the game data is still out of sync when I ran it with this option setting.
I tried setting the priority of the Request() in parse_schedule_summary_page() but that didn't fix the problem either.
So I tried another recommendation and changed this code:
yield Request(summary_url, priority=5, meta={'dict': game_dict},\
callback=self.parse_game_page)
to this:
return [Request(summary_url, priority=5, meta={'dict': game_dict},\
callback=self.parse_game_page)]
By not using the yield command, the game info from the <TR> in the Game Schedule page is in sync with the data from the Game Stat summary Page. However, the spider ends after only processing one <TR>.
How can I get the Request() call in parse_schedule_summary_page() to call each of the hrefs in the <TR> entries and scrape each of the game stat summary pages rather than just stopping after processing one <TR>?
Upvotes: 1
Views: 544
Reputation: 2469
Another method.
from simplified_scrapy import Spider, SimplifiedDoc, SimplifiedMain, utils
class MySpider(Spider):
name = 'my_spider'
start_urls = ['https://somewebsite.com/2019.GameSchedule.htm']
# refresh_urls = True # If you want to download the downloaded link again, please remove the "#" in the front
def extract(self, url, html, models, modelNames):
doc = SimplifiedDoc(html)
lstA = []
game_dict = {}
if url.url in self.start_urls:
# Get all rows
trs = doc.getElement('table', attr='type', value='games').trs
for tr in trs:
cols = tr.children
url = {
'url': utils.absoluteUrl(url.url, tr.a.href)
} # Splicing URL full path
for col in cols:
url[col['data']] = col.text # Pass data to the next page
lstA.append(url)
else:
cols = doc.getElement('table', attr='type',
value='game stat').tr.tds
for col in cols:
game_dict[col['data']] = col.text
game_dict['week_number'] = url[
'week_number'] # Use the data from the previous page
game_dict['game_summary'] = url['game_summary']
return {
'Urls': lstA, 'Data': game_dict
} # Return the data to the framework, which will save it for you.
SimplifiedMain.startThread(MySpider()) # Start download
Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
Upvotes: 0