Reputation: 938
I want to extract some data from a website I subscribe to. I can extract the csv files manually but there is a file for each day and I want 5 years worth of data so it will take too long.
I have used rvest to log into the site but to download the data I need to manually click a button, how can I do this within R?
When I do it manually the file goes into my download folder which is totally fine since I can do a mass import, but equally if I can just load the data straight into R that would be a little easier.
Here is what I have so far:
library(rvest)
url <-"http://www.horseracebase.com/index.php/"
pgsession <-html_session(url) ## create session
pgform <-html_form(pgsession)[[1]] ## pull form from session
filled_form <- set_values(pgform,
`login` = "xxx",
`password` = "yyy")
submit_form(pgsession,filled_form)
This gets me logged in (I think) but now I don't know how to extract the data?
I do the same thing on Betfair where I use something like:
df <- read.csv("http://www.someurl.com/betfairdata.csv")
This works fine but all their files are listed on the actual page so no clicking of buttons required.
Is there any way to interact with the button using rvest or is there a way of finding the correct URL so I can just use read.csv
as above?
Thanks
Upvotes: 2
Views: 10348
Reputation: 23231
I created a free account and examined the website.
It looks like you're conflating .csv files with HTML tables. Nowhere in the site that I can find are there any .csv files. When you say "there's no clicking" and that they "display the .csv files" it's an HTML table that you're actually describing.
On a side note, there are also no .csv files in the website's backend. The backend is a relational database, which powers their many filters and search features.
Having said all that, there are a large number of resources on how to scrape XML and HTML tables using rvest
and other R packages both within the documentation, in SO answers and in various blogs on Google (if not on Stack Overflow Documentation -- I haven't check but it's probably there too). I will quote one from Stats and Things, but note that html()
has been deprecated in favor of read_html()
, though both still work:
library("rvest")
url <- "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
population <- url %>%
html() %>%
html_nodes(xpath='//*[@id="mw-content-text"]/table[1]') %>%
html_table()
population <- population[[1]]
head(population)
You can use the built-in Developer Tools in Chrome or FireFox to identify the part of the page that you need to extract. You can also use 3rd party tools like Firebug, but it's not really necessary.
For example:
/html/body/table[2]/tbody/tr/td/table[2]
Of course, if you want to save your parsed results into a .csv (or any other kind of) file, you can do that after you've got it into a data.frame
, data.table
, or other flat object:
write.csv(population, "population.csv", row.names=F)
Note that some people find it easier to scrape tables with readHTMLTable()
from the XML
package, though both can do the job.
Upvotes: 3