Cole Barrett
Cole Barrett

Reputation: 11

Confused about XPath Syntax

  1. Problem Summary:

Hi, I'm trying to learn to use the Scrapy Framework for python (available at https://scrapy.org). I'm following along with a tutorial I found here: https://www.scrapehero.com/scrape-alibaba-using-scrapy/, but I was going to use a different site for practice rather than just copy them on Alibaba. My goal is to get game data from https://www.mlb.com/scores.

So I need to use Xpath to tell the spider which parts of the html to scrape, (I'm about halfway down on that tutorial page on the scrapehero site, at the "Construct Xpath selectors for the product list" section). Problem is I'm having a hell of a time figuring out what syntax should actually be to get the pieces I want? I've been going over xpath examples all morning trying to figure out the right syntax but I haven't been able to get it.

  1. Background info:

So what I want is- from https://www.mlb.com/scores, I want an xpath() command which will return an array with all the games displayed.

Following along with the tutorial, what I understand about how to do this is I'd want to inspect the elements from the webpage, determine their class/id, and specific that in the xpath command.

I've tried a lot of variations to get the data but all are returning empty arrays.

I don't really have any training in XPath so I'm not sure if my syntax is just off somewhere or what, but I'd really appreciate any help on getting this command to return the objects I'm looking for. Thanks for taking the time to read this.

  1. Code:

Here are some of the attempts that didn't work:

response.xpath("//div[@class='g5-component--mlb-scores__game-wrapper']")
response.xpath("//div[@class='g5-component]")
response.xpath("//li[@class='mlb-scores__list-item mlb-scores__list-item--game']")
response.xpath("//li[@class='mlb-scores__list-item']")
response.xpath("//div[@!data-game-pk-id > 0]")'
response.xpath("//div[contains(@class, 'g5-component')]")
  1. Expected Results and Actual Results

I want an XPath command that returns an array containing a selector object for each game on the mlb.com/scores page.

So far I've been able to get generic returns that aren't actually what I want (I can get a selector that returns the whole page by just leaving out the predicates, but whenever I try to specify I end up with an empty array).

So for all my attempts I either get the wrong objects or an empty array.

Upvotes: 1

Views: 75

Answers (1)

gangabass
gangabass

Reputation: 10666

You need to always check HTML source code (Ctrl+U in a browser) for the data you need. For MLB page you'll find that content you are want to parse is loaded dynamically using JavaScript.

You can try to use Scrapy-Splash to get target content from your start_urls or you can find direct HTTP request used to get information you want (using Network tab of Chrome Developer Tools) and parse JSON:

https://statsapi.mlb.com/api/v1/schedule?sportId=1,51&date=2019-06-26&gameTypes=E,S,R,A,F,D,L,W&hydrate=team(leaders(showOnPreview(leaderCategories=[homeRuns,runsBattedIn,battingAverage],statGroup=[pitching,hitting]))),linescore(matchup,runners),flags,liveLookin,review,broadcasts(all),decisions,person,probablePitcher,stats,homeRuns,previousPlay,game(content(media(featured,epg),summary),tickets),seriesStatus(useOverride=true)&useLatestGames=false&language=en&leagueId=103,104,420

Upvotes: 1

Related Questions