FrsLry
FrsLry

Reputation: 417

Web scraping with R: can't see the downloadable links

I am trying to download some .xlsx files from this kind of webpage EDIT or this one. However, when I want to display the source code (right click --> view source code), I can't see all the content of the actual webpage (just the header and the footer).

I tried to use the rvest to display the downloadable links but same here, it returns only the ones from the header and the footer:

library(rvest)
html("https://b2share.eudat.eu/records/8d47a255ba5749e3ac169527e22f0068") %>% 
     html_nodes("a")

Returns:

#{xml_nodeset (5)}
#[1] <a href="https://eudat.eu">Go to EUDAT website</a>
#[2] <a href="https://eudat.eu"><img src="/img/logo_eudat_cdi.svg" alt="EUDAT CDI logo" style="max-width: 200px"></a>
#[3] <a href="https://www.eudat.eu/eudat-cdi-aup">Acceptable Use #Policy </a>
#[4] <a href="https://eudat.eu/privacy-policy-summary">Data Privacy Statement</a>
#[5] <a href="https://eudat.eu/what-eudat">About EUDAT</a>

Any idea how to access the content of the all page?

Upvotes: 1

Views: 98

Answers (1)

QHarr
QHarr

Reputation: 84465

You need to pass the record id to an API endpoint which provides the parts to construct the file download links as follows:

library(jsonlite)

d <- jsonlite::read_json('https://b2share.eudat.eu/api/records/8d47a255ba5749e3ac169527e22f0068')

files <- paste(d$links$files, d$files[[1]]$key , sep = '/')

For re-use, you can re-write as a function accepting the start link as argument:

library(jsonlite)
library(stringr)

get_links <- function(link){
  record_id <- tail(str_split(link, '/')[[1]], 1)
  d <- jsonlite::read_json(paste0('https://b2share.eudat.eu/api/records/', record_id))
  links <- paste(d$links$files, d$files[[1]]$key , sep = '/')
  return(links)
}

get_links('https://b2share.eudat.eu/records/ce32a67a789b44a1a15965fd28a8cb17')
get_links('https://b2share.eudat.eu/records/8d47a255ba5749e3ac169527e22f0068')

Which you could simplify to:

library(jsonlite)

get_links <- function(record_id){
  d <- jsonlite::read_json(paste0('https://b2share.eudat.eu/api/records/', record_id))
  links <- paste(d$links$files, d$files[[1]]$key , sep = '/')
  return(links)
}

get_links('ce32a67a789b44a1a15965fd28a8cb17')
get_links('8d47a255ba5749e3ac169527e22f0068')

Upvotes: 2

Related Questions