clattenburg cake
clattenburg cake

Reputation: 1222

Python Scrapy Returning 200 But Closes Spider With Nothing

New to Scrapy and trying to scrape some simple Html tables. I've found a site with the same schema for two different tables in the same page, however the scrape seems to work in one of the cases but not the other. Here's the link: https://fbref.com/en/comps/12/stats/La-Liga-Stats

My code that works (the first table, the one at the top):

import scrapy


class PostSpider(scrapy.Spider):

    name = 'stats'

    start_urls = [
        'https://fbref.com/en/comps/12/stats/La-Liga-Stats',
    ]

    def parse(self, response):
       for row in response.xpath('//*[@id="stats_standard_squads"]//tbody/tr'):
           yield {
               'players': row.xpath('td[2]//text()').extract_first(),
               'possession': row.xpath('td[3]//text()').extract_first(),
               'played': row.xpath('td[4]//text()').extract_first(),
               'starts': row.xpath('td[5]//text()').extract_first(),
               'minutes': row.xpath('td[6]//text()').extract_first(),
               'goals': row.xpath('td[7]//text()').extract_first(),
               'assists': row.xpath('td[8]//text()').extract_first(),
               'penalties': row.xpath('td[9]//text()').extract_first(),
           }

Now for some reason, when I try to scrape the table below (using the relevant xPath selector), it returns nothing:

import scrapy


class PostSpider(scrapy.Spider):

    name = 'stats'

    start_urls = [
        'https://fbref.com/en/comps/12/stats/La-Liga-Stats',
    ]

    def parse(self, response):

       for row in response.xpath('//*[@id="stats_standard"]//tbody/tr'):
           yield {
               'player': row.xpath('td[2]//text()').extract_first(),
               'nation': row.xpath('td[3]//text()').extract_first(),
               'pos': row.xpath('td[4]//text()').extract_first(),
               'squad': row.xpath('td[5]//text()').extract_first(),
               'age': row.xpath('td[6]//text()').extract_first(),
               'born': row.xpath('td[7]//text()').extract_first(),
               '90s': row.xpath('td[8]//text()').extract_first(),
               'att': row.xpath('td[9]//text()').extract_first(),
           }

Here's the logs from the terminal when I execute scrapy crawl stats:

2020-07-23 17:35:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://fbref.com/robots.txt> (referer: None)
2020-07-23 17:35:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://fbref.com/en/comps/12/stats/La-Liga-Stats> (referer: None)
2020-07-23 17:35:34 [scrapy.core.engine] INFO: Closing spider (finished)

What's the reason this is happening? The tables have an identical structure as far as I can see.

Upvotes: 0

Views: 212

Answers (1)

Ikram Khan Niazi
Ikram Khan Niazi

Reputation: 805

The problem is id="stats_standard" is not available in the source have a look here view-source:https://fbref.com/en/comps/12/stats/La-Liga-Stats in the live HTML code. It is available as commented code.

Try response.css('.placeholder ::text').getall(). You need to parse it using regex or you can use the library from scrapy import Selector.

from scrapy import Selector    
Selector(text=you_raw_html)

Upvotes: 1

Related Questions