Reputation: 1976
I am trying to scrape date from a URL. The data is not in HTML tables, so pandas.read_html() is not picking it up.
The URL is: https://www.athlinks.com/event/1015/results/Event/638761/Course/988506/Results
The data I'd like to get is a table gender, age, time for the past 5k races (name is not really important). The data is presented in the web page 50 at a time for around 25 pages.
It uses various javascript frameworks for the UI (node.js, react). Found this out using the "What Runs" ad-on in chrome browser.
Here's the real reason I'd like to get this data. I'm a new runner and will be participating in this 5k next weeked and would like to explore some of the distribution statistics for past faces (its an annual race, and data goes back to 1980's).
Thanks in advance!
Upvotes: 1
Views: 311
Reputation: 396
You actually need to render the JS in a browser engine before crawling the generated HTML. Have you tried https://github.com/scrapinghub/splash, https://github.com/miyakogi/pyppeteer, or https://www.npmjs.com/package/spa-crawler ? You can also try to inspect the page (F12 -> Networking) while is loading the data relevant to you (from a restful api, I suppose), and then make the same calls from command line using curl
or the requests
python library.
Upvotes: 0
Reputation: 695
The data comes from socket.io, and there are python packages for it. How did I find it?
If you open Network panel in your browser and choose XHR filter, you'll find something like https://results-hub.athlinks.com/socket.io/?EIO=3&transport=polling&t=MYOPtCN&sid=5C1HrIXd0GRFLf0KAZZi
Look into content it is what we need.
Luckily this site has a source maps. Now you can go to More tools -> Search and find this domain.
And then find resultsHubUrl
in settings.
setUpSocket
. setUpSocket
used inside IndividualResultsStream.js
and RaseStreams.js
.Now you can press CMD + P
and go deep down to this files.
So... I've spent around five minutes to find it. You can go ahead! Now you have all the necessary tools. Feel free to use breakpoints and read more about chrome developer tools.
Upvotes: 1