Reputation: 1222
New to Scrapy and trying to scrape some simple Html tables. I've found a site with the same schema for two different tables in the same page, however the scrape seems to work in one of the cases but not the other. Here's the link: https://fbref.com/en/comps/12/stats/La-Liga-Stats
My code that works (the first table, the one at the top):
import scrapy
class PostSpider(scrapy.Spider):
name = 'stats'
start_urls = [
'https://fbref.com/en/comps/12/stats/La-Liga-Stats',
]
def parse(self, response):
for row in response.xpath('//*[@id="stats_standard_squads"]//tbody/tr'):
yield {
'players': row.xpath('td[2]//text()').extract_first(),
'possession': row.xpath('td[3]//text()').extract_first(),
'played': row.xpath('td[4]//text()').extract_first(),
'starts': row.xpath('td[5]//text()').extract_first(),
'minutes': row.xpath('td[6]//text()').extract_first(),
'goals': row.xpath('td[7]//text()').extract_first(),
'assists': row.xpath('td[8]//text()').extract_first(),
'penalties': row.xpath('td[9]//text()').extract_first(),
}
Now for some reason, when I try to scrape the table below (using the relevant xPath selector), it returns nothing:
import scrapy
class PostSpider(scrapy.Spider):
name = 'stats'
start_urls = [
'https://fbref.com/en/comps/12/stats/La-Liga-Stats',
]
def parse(self, response):
for row in response.xpath('//*[@id="stats_standard"]//tbody/tr'):
yield {
'player': row.xpath('td[2]//text()').extract_first(),
'nation': row.xpath('td[3]//text()').extract_first(),
'pos': row.xpath('td[4]//text()').extract_first(),
'squad': row.xpath('td[5]//text()').extract_first(),
'age': row.xpath('td[6]//text()').extract_first(),
'born': row.xpath('td[7]//text()').extract_first(),
'90s': row.xpath('td[8]//text()').extract_first(),
'att': row.xpath('td[9]//text()').extract_first(),
}
Here's the logs from the terminal when I execute scrapy crawl stats
:
2020-07-23 17:35:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://fbref.com/robots.txt> (referer: None)
2020-07-23 17:35:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://fbref.com/en/comps/12/stats/La-Liga-Stats> (referer: None)
2020-07-23 17:35:34 [scrapy.core.engine] INFO: Closing spider (finished)
What's the reason this is happening? The tables have an identical structure as far as I can see.
Upvotes: 0
Views: 212
Reputation: 805
The problem is id="stats_standard"
is not available in the source have a look here view-source:https://fbref.com/en/comps/12/stats/La-Liga-Stats
in the live HTML code. It is available as commented code.
Try response.css('.placeholder ::text').getall()
. You need to parse it using regex or you can use the library from scrapy import Selector
.
from scrapy import Selector
Selector(text=you_raw_html)
Upvotes: 1