MidnightDataGeek
MidnightDataGeek

Reputation: 938

Extracting .csv files from a website

I want to extract some data from a website I subscribe to. I can extract the csv files manually but there is a file for each day and I want 5 years worth of data so it will take too long.

I have used rvest to log into the site but to download the data I need to manually click a button, how can I do this within R?

When I do it manually the file goes into my download folder which is totally fine since I can do a mass import, but equally if I can just load the data straight into R that would be a little easier.

Here is what I have so far:

library(rvest)

url       <-"http://www.horseracebase.com/index.php/"
pgsession <-html_session(url)               ## create session
pgform    <-html_form(pgsession)[[1]]       ## pull form from session

filled_form <- set_values(pgform,
                      `login` = "xxx", 
                      `password` = "yyy")

submit_form(pgsession,filled_form)

This gets me logged in (I think) but now I don't know how to extract the data?

I do the same thing on Betfair where I use something like:

df <- read.csv("http://www.someurl.com/betfairdata.csv")

This works fine but all their files are listed on the actual page so no clicking of buttons required.

Is there any way to interact with the button using rvest or is there a way of finding the correct URL so I can just use read.csv as above?

Thanks

Upvotes: 2

Views: 10348

Answers (1)

Hack-R
Hack-R

Reputation: 23231

I created a free account and examined the website.

It looks like you're conflating .csv files with HTML tables. Nowhere in the site that I can find are there any .csv files. When you say "there's no clicking" and that they "display the .csv files" it's an HTML table that you're actually describing.

On a side note, there are also no .csv files in the website's backend. The backend is a relational database, which powers their many filters and search features.

Having said all that, there are a large number of resources on how to scrape XML and HTML tables using rvest and other R packages both within the documentation, in SO answers and in various blogs on Google (if not on Stack Overflow Documentation -- I haven't check but it's probably there too). I will quote one from Stats and Things, but note that html() has been deprecated in favor of read_html(), though both still work:

library("rvest")
url <- "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
population <- url %>%
  html() %>%
  html_nodes(xpath='//*[@id="mw-content-text"]/table[1]') %>%
  html_table()
population <- population[[1]]

head(population)

You can use the built-in Developer Tools in Chrome or FireFox to identify the part of the page that you need to extract. You can also use 3rd party tools like Firebug, but it's not really necessary.

For example:

  1. In your browser login and click Research
  2. Click Statistics
  3. Click Jockey 7 Day Performance
  4. Open Developer Tools
  5. Click the button that says "Select an element to inspect it"
  6. Hover or click on the HTML table
  7. The highlighted source code corresponds to this table - right click the highlighted source code
  8. Click "Copy X-Path"
  9. Your clipboard now has the target - in this example it's /html/body/table[2]/tbody/tr/td/table[2]

Of course, if you want to save your parsed results into a .csv (or any other kind of) file, you can do that after you've got it into a data.frame, data.table, or other flat object:

write.csv(population, "population.csv", row.names=F)

Note that some people find it easier to scrape tables with readHTMLTable() from the XML package, though both can do the job.

Upvotes: 3

Related Questions