Reputation: 79
I am attempting to get a list of all links of satellite data for a year/month on the European Space Agency's Cryosat-2 website (https://science-pds.cryosat.esa.int/#Cry0Sat2_data%2FSIR_SAR_L2%2F2013%2F02). No matter what web scraping or html reading package I use, the links are never included. Below is an example of such an attempt with the url provided, but it is by no means my only attempt. I am looking for an explanation as to why the links that initiate the download of individual files aren't extracted, and what the solution is to obtaining them.
library(textreadr)
html_string<- 'https://science-pds.cryosat.esa.int/#Cry0Sat2_data%2FSIR_SAR_L2%2F2013%2F02'
html_read<- read_html(html_string)
html_read
[1] "Layer 1" "European Space Agency"
[3] "CryoSat-2 Science Server" "The access and use of CryoSat-2 products are regulated by the"
[5] "ESA's Data Policy" "and subject to the acceptance of the specific"
[7] "Terms & Conditions" "."
[9] "Users accessing CryoSat-2 products are intrinsically acknowledging and accepting the above." "Name"
[11] "Modified" "Size"
[13] "ESA European Space Agency"
Upvotes: 2
Views: 33
Reputation: 5673
Ok, here is a solution. In this kind of case where you can't get the info with regular scraping (rvest
and so on), two solutions:
RSelenium
and it can be tediousYou will find out that the data are loaded from urls looking like:
These pages appear like html, but if you open them, they are not. They are JSON. The webpage just display dynamically the info requested as JSON. So you can simply get the info as follow:
url <- 'https://science-pds.cryosat.esa.int/?do=list&maxfiles=500&pos=5500&file=Cry0Sat2_data/SIR_SAR_L2/2021/02'
library(jsonlite)
fromJSON(url)
$success
[1] TRUE
$is_writable
[1] FALSE
$results
mtime size name
1 1616543429 37713 CS_OFFL_SIR_SAR_2__20210223T041611_20210223T042157_D001.HDR
2 1616543428 845594 CS_OFFL_SIR_SAR_2__20210223T041611_20210223T042157_D001.nc
3 1616543364 37713 CS_OFFL_SIR_SAR_2__20210223T043539_20210223T043844_D001.HDR
4 1616543363 528578 CS_OFFL_SIR_SAR_2__20210223T043539_20210223T043844_D001.nc
5 1616543321 37713 CS_OFFL_SIR_SAR_2__20210223T044915_20210223T045113_D001.HDR
6 1616543322 387650 CS_OFFL_SIR_SAR_2__20210223T044915_20210223T045113_D001.nc
7 1616543360 37713 CS_OFFL_SIR_SAR_2__20210223T045156_20210223T045427_D001.HDR
8 1616543359 456414 CS_OFFL_SIR_SAR_2__20210223T045156_20210223T045427_D001.nc
9 1616543328 37713 CS_OFFL_SIR_SAR_2__20210223T045551_20210223T045749_D001.HDR
10 1616543327 385998 CS_OFFL_SIR_SAR_2__20210223T045551_20210223T045749_D001.nc
path
1 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T041611_20210223T042157_D001.HDR
2 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T041611_20210223T042157_D001.nc
3 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T043539_20210223T043844_D001.HDR
4 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T043539_20210223T043844_D001.nc
5 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T044915_20210223T045113_D001.HDR
6 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T044915_20210223T045113_D001.nc
7 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T045156_20210223T045427_D001.HDR
8 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T045156_20210223T045427_D001.nc
9 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T045551_20210223T045749_D001.HDR
10 Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T045551_20210223T045749_D001.nc
These should give you all the info you need. If you tweak a bit the url, you should be able to get all the info from the date you want.
Upvotes: 2
Reputation: 506
The problem seems to be that the page is dynamic. It probably has some JS code and only loads the links after it runs. So when you get the HTML from the link, you're only getting the base page (before the JS runs).
I can think of two possible solutions:
selenium
, which emulates an user in the browser so it will load the page completely, but the set-up might be a bit complicated. See for an intro https://www.r-bloggers.com/2014/12/scraping-with-selenium/Upvotes: 1