Noffy
Noffy

Reputation: 11

How to use R to download a file from webpage when there is no specific file embedded on the page

Is there any possible solution to extract the file from any website when there is no specific file uploaded using download.file() in R.

I have this url

https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2016&month=0&season1=2016&ind=0

there is a link to export csv file to my working directory, but when i right click on the export data hyperlink on the webpage and select the link address it turns to be the following script

javascript:__doPostBack('LeaderBoard1$cmdCSV','') 

instead of the url which give me access to the csv file.

Is there any solution to tackle this problem.

Upvotes: 1

Views: 3384

Answers (1)

Benjamin
Benjamin

Reputation: 896

You can use RSelenium for jobs like this. The script below works for me exactly as is, and it should for you as well with minor edits noted in the text. The solution uses two packages: RSelenium to automate Chrome, and here to select your active directory.

library(RSelenium)
library(here)

Here's the URL you provided:

url <- paste0(
  "https://www.fangraphs.com/leaders.aspx",
  "?pos=all",
  "&stats=bat",
  "&lg=all",
  "&qual=y",
  "&type=8",
  "&season=2016",
  "&month=0",
  "&season1=2016",
  "&ind=0"
)

Here's the ID of the download button. You can find it by right-clicking the button in Chrome and hitting "Inspect."

button_id <- "LeaderBoard1_cmdCSV"

We're going to automate Chrome to download the file, and it's going to go to your default download location. At the end of the script we'll want to move it to your current directory. So first let's set the name of the file (per fangraphs.com) and your download location (which you should edit as needed):

filename <- "FanGraphs Leaderboard.csv"
download_location <- file.path(Sys.getenv("USERPROFILE"), "Downloads")

Now you'll want to start a browser session. I use Chrome, and specifying this particular Chrome version (using the chromever argument) works for me. YMMV; check the best way to start a browser session for you.

An rsDriver object has two parts: a server and a browser client. Most of the magic happens in the browser client.

driver <- rsDriver(
  browser = "chrome",
  chromever = "74.0.3729.6"
)
server <- driver$server
browser <- driver$client

Using the browser client, navigate to the page and click that button.

Quick note before you do: RSelenium may start looking for the button and trying to click it before there's anything to click. So I added a few lines to watch for the button to show up, and then click it once it's there.

buttons <- list()
browser$navigate(url)
while (length(buttons) == 0) {
  buttons <- browser$findElements(button_id, using = "id")
}
buttons[[1]]$clickElement()

Then wait for the file to show up in your downloads folder, and move it to the current project directory:

while (!file.exists(file.path(download_location, filename))) {
  Sys.sleep(0.1)
}
file.rename(file.path(download_location, filename), here(filename))

Lastly, always clean up your server and browser client, or RSelenium gets quirky with you.

browser$close()
server$stop()

And you're on your merry way!


Note that you won't always have an element ID to use, and that's OK. IDs are great because they uniquely identify an element and using them requires almost no knowledge of website language. But if you don't have an ID to use, above where I specify using = "id", you have a lot of other options:

  • using = "xpath"
  • using = "css selector"
  • using = "name"
  • using = "tag name"
  • using = "class name"
  • using = "link text"
  • using = "partial link text"

Those give you a ton of alternatives and really allow you to identify anything on the page. findElements will always return a list. If there's nothing to find, that list will be of length zero. If it finds multiple elements, you'll get all of them.

XPath and CSS selectors in particular are super versatile. And you can find them without really knowing what you're doing. Let's walk through an example with the "Sign In" button on that page, which in fact does not have an ID.

Start in Chrome by pretty Control+Shift+J to get the Developer Console. In the upper left corner of the panel that shows up is a little icon for selecting elements:

enter image description here

Click that, and then click on the element you want:

enter image description here

That'll pull it up (highlight it) over in the "Elements" panel. Right-click the highlighted line and click "Copy selector." You can also click "Copy XPath," if you want to use XPath.

enter image description here

And that gives you your code!

buttons <- browser$findElements(
  "#linkAccount > div > div.label-account",
  using = "css selector"
)
buttons[[1]]$clickElement()

Boom.

Upvotes: 5

Related Questions