Get Scrapy spider to crawl depth first in order

Question

I'm trying to get a Scrapy spider to crawl a website's pages in the order my spider code makes the Request() calls.

It's similar to this question: Scrapy Crawl URLs in Order

I examined the responses for this question and I tried them but none of them work quite the way I need to.

My problem is that I need to scrape a table on a page. Each table has a and one of the values is an href to another page. The first callback method scrapes the table, then makes a subsequent Request() call using the href to another page. The first page will make calls to many other pages. I pass along the data from the first callback method in a dict using the meta keyword to the second callback method.

The second callback method scrapes the contents of that page and adds the parsed data to the dict that it was passed to. But the data from the first callback isn't always for the same game as the data from the second callback.

The XML document of the first page looks like this:

# Game Schedule page

  
    
      
        
          
            1
            "9/13/2020"
            1
            game stats
          
          
            1
            "9/13/2020"
            2
            game stats
          
          
            1
            "9/13/2020"
            3
            game stats

Of course, there are more than 3 entries but I will just list 3 rows for the example.

Each of the rows also has an href to another page that has the game stat summary for each of the game. A sample game stat summary page would look something like this (it's not the exact copy but close enough for this example):

# A sample Game Stat summary page

  
    Team A @ Team B
    
      
        
          
            "9/13/2020"
            1
            2:00PM EST
            Team A
            43
            Team B
            53

My spider scrapes the Game Summary Page, parses each of the entries using XPath, then stores it in a dict.

import os
import sys
import urlparse
from lxml import etree, html
from scrapy.http import Request
from scrapy.loader import ItemLoader

class TestSpider(scrapy.Spider):
    name = "test_spider"

    season_flag = False
    season_val = ""

   """
    I need to override the __init__() method of scrapy.Spider
    because I need to define some attributes/variables
    """
    def __init__(self, *a, **kw):
        super(TestSpider, self).__init__(*a, **kw)
        self.season_flag = False
        self.debug_flag = False
        self.season_val = ""
        self.game_list = list()
        self.game_dict = dict()

        if hasattr(self, "season"):
            self.season_val = str(self.season)
            self.season_flag = True
        else:
            self.log("No season argument. Exiting")
            sys.exit(1)

        if hasattr(self, "debug"):
            if self.debug == True:
                self.debug_flag = True

    """
    Start the request by starting the scraping at 
    page that has the game schedule in a table
    """
    def start_requests(self):

        url_list = [
            "https://somewebsite.com/2019.GameSchedule.htm"
        ]
        for url in url_list:
            yield Request(url=url,\
                callback=self.parse_schedule_summary_page)

The callback method "parse_schedule_summary_page" parses the s from the Game Schedule page, including the URL to each game's game stat summary page.

It makes a "yield Request()" call and part of the Request()'s argument is the game_dict, using the "meta" keyword.

    def parse_schedule_summary_page(self, response):

        """
        Convert the response object to an lxml tree object.
        """
        decoded = response.body.decode('utf-8')
        html_tree = html.fromstring(decoded)

        l_game_elem_list = list()
        # This extracts all the s from the 'games' table and stores it in a list
        l_game_elem_list = html_tree.xpath("//table[@type = 'games']/tbody/tr")
        num_l_games = len(l_game_elem_list)

        # Iterate thru each of the  elements
        for i in range(num_l_games):
            game_dict = dict()
            """
            Parse the week number, date, game id, and URL to 
            the game stat page
            """
            l_game_elem = l_game_elem_list[i]
            p_weeknum = l_game_elem.xpath(".//th[@data = 'week_num']/text()")
            p_date = l_game_elem.xpath(".//td[@data = 'date']/text()")
            p_game_id = l_game_elem.xpath(".//td[@data = 'game_id']/text()")
            summary_url = l_game_elem.xpath("string(.//a[string() = 'game stats']/@href)")

            game_dict['week_num'] = p_weeknum
            game_dict['date'] = p_date
            game_dict['game_id'] = p_game_id

            # This is where the code gets wonky
            yield Request(summary_url, priority=5, meta={'dict': game_dict},\
                callback=self.parse_game_page)

parse_schedule_summary_page() then makes a Request call to the game stat summary page and its callback method is parse_game_page

    def parse_game_page(self, response):
        game_dict = response.meta.get('dict')

        """
        Convert the response object to an lxml tree object.
        """
        decoded = response.body.decode('utf-8')
        html_tree = html.fromstring(decoded)

        game_date = xpath("//td[@data = 'date']/text()")
        game_id = xpath("//td[@data = 'game_id']/text()")
        game_time = xpath("//td[@data = 'game_time']/text()")
        v_team = xpath("//td[@data = 'visit_team']/text()")
        v_team_score = xpath("//td[@data = 'visit_team_score']/text()")
        h_team = xpath("//td[@data = 'home_team']/text()")
        h_team_score = xpath("//td[@data = 'home_team_score']/text()")

        game_dict['game_time'] = game_time
        game_dict['x_game_id'] = game_id
        # I copy the rest of the values I parsed via XPath into
        # the game_dict dictionary.  I won't repeat the code here
        # for brevity's sake.

        # Here's where I print it out, for debugging purposes
        stmt = "==**==**==**==
"
        stmt += str(game_dict)
        stmt += "
**==**==**==**"
        self.log(stmt)

From the output of the self.log(stmt) statement, I've noticed that the game_id entry and the x_game_id entry are not the same when they should be:

==**==**==**==
{'v_team': 'Team A', 'game_time': '6:30PM','game_id': '1', 'h_team_score': '53', 
'h_team': 'Team B', 'week_num': '1', 'date': '9/13/2020', 'x_game_id': '7', 'v_team_score': '43'}
**==**==**==**

The game_id from parse_schedule_summary_page() does not match the x_game_id from parse_game_page(). It is like this for most, but not all of the games.

Referring to the earlier question above, this is because Scrapy cannot ensure the order in the URLs looked at.

From its recommendations, I first changed this configuration in my settings.py file:

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 1

It didn't help; the game data is still out of sync when I ran it with this option setting.

I tried setting the priority of the Request() in parse_schedule_summary_page() but that didn't fix the problem either.

So I tried another recommendation and changed this code:

    yield Request(summary_url, priority=5, meta={'dict': game_dict},\
        callback=self.parse_game_page)

to this:

    return [Request(summary_url, priority=5, meta={'dict': game_dict},\
        callback=self.parse_game_page)]

By not using the yield command, the game info from the in the Game Schedule page is in sync with the data from the Game Stat summary Page. However, the spider ends after only processing one .

How can I get the Request() call in parse_schedule_summary_page() to call each of the hrefs in the entries and scrape each of the game stat summary pages rather than just stopping after processing one ?

Get Scrapy spider to crawl depth first in order

Answers (1)

Related Questions