How to scrape javascript table in R?

Question

I want to scrape a table from the citibike : https://s3.amazonaws.com/tripdata/index.html

My goal is to get the urls of the zip files all at once, instead of manually type all the dates and downloading one at each time. Since the webpage is updated monthly, every time I run the function, I want be able to get all the up-to-date data files.

I first tried to use Rvest and XML packages and then realized that the webpage contains both the html and a table that's generated by a javascript function. That's where the problem was.

Really appreciate any help and please let me know if I could provide further information.

deamentiaemundi · Accepted Answer

If I go to https://s3.amazonaws.com/tripdata/ (just the root, no index.html) I get a simple XML file. The relevant element is Key (uppercase K, lowercase e,y) if you want to parse the XML but I would just search the plain text, that is: ignore the XML, treat it like a simple text file, get every string between and treat that as the filename that it is and prefix https://s3.amazonaws.com/tripdata/ to get it.

The first entry is all together (170 MB) as it seems, so you might be ok with that alone.

How to scrape javascript table in R?

Answers (1)

Related Questions