Reputation: 417
I am trying to download some .xlsx
files from this kind of webpage EDIT or this one. However, when I want to display the source code (right click --> view source code), I can't see all the content of the actual webpage (just the header and the footer).
I tried to use the rvest
to display the downloadable links but same here, it returns only the ones from the header and the footer:
library(rvest)
html("https://b2share.eudat.eu/records/8d47a255ba5749e3ac169527e22f0068") %>%
html_nodes("a")
Returns:
#{xml_nodeset (5)}
#[1] <a href="https://eudat.eu">Go to EUDAT website</a>
#[2] <a href="https://eudat.eu"><img src="/img/logo_eudat_cdi.svg" alt="EUDAT CDI logo" style="max-width: 200px"></a>
#[3] <a href="https://www.eudat.eu/eudat-cdi-aup">Acceptable Use #Policy </a>
#[4] <a href="https://eudat.eu/privacy-policy-summary">Data Privacy Statement</a>
#[5] <a href="https://eudat.eu/what-eudat">About EUDAT</a>
Any idea how to access the content of the all page?
Upvotes: 1
Views: 98
Reputation: 84465
You need to pass the record id to an API endpoint which provides the parts to construct the file download links as follows:
library(jsonlite)
d <- jsonlite::read_json('https://b2share.eudat.eu/api/records/8d47a255ba5749e3ac169527e22f0068')
files <- paste(d$links$files, d$files[[1]]$key , sep = '/')
For re-use, you can re-write as a function accepting the start link as argument:
library(jsonlite)
library(stringr)
get_links <- function(link){
record_id <- tail(str_split(link, '/')[[1]], 1)
d <- jsonlite::read_json(paste0('https://b2share.eudat.eu/api/records/', record_id))
links <- paste(d$links$files, d$files[[1]]$key , sep = '/')
return(links)
}
get_links('https://b2share.eudat.eu/records/ce32a67a789b44a1a15965fd28a8cb17')
get_links('https://b2share.eudat.eu/records/8d47a255ba5749e3ac169527e22f0068')
Which you could simplify to:
library(jsonlite)
get_links <- function(record_id){
d <- jsonlite::read_json(paste0('https://b2share.eudat.eu/api/records/', record_id))
links <- paste(d$links$files, d$files[[1]]$key , sep = '/')
return(links)
}
get_links('ce32a67a789b44a1a15965fd28a8cb17')
get_links('8d47a255ba5749e3ac169527e22f0068')
Upvotes: 2