Wincow
Wincow

Reputation: 79

links absent when reading html into R

I am attempting to get a list of all links of satellite data for a year/month on the European Space Agency's Cryosat-2 website (https://science-pds.cryosat.esa.int/#Cry0Sat2_data%2FSIR_SAR_L2%2F2013%2F02). No matter what web scraping or html reading package I use, the links are never included. Below is an example of such an attempt with the url provided, but it is by no means my only attempt. I am looking for an explanation as to why the links that initiate the download of individual files aren't extracted, and what the solution is to obtaining them.

library(textreadr)
html_string<- 'https://science-pds.cryosat.esa.int/#Cry0Sat2_data%2FSIR_SAR_L2%2F2013%2F02'
html_read<- read_html(html_string)
html_read

 [1] "Layer 1"                                                                                     "European Space Agency"                                                                      
 [3] "CryoSat-2 Science Server"                                                                    "The access and use of CryoSat-2 products are regulated by the"                              
 [5] "ESA's Data Policy"                                                                           "and subject to the acceptance of the specific"                                              
 [7] "Terms & Conditions"                                                                          "."                                                                                          
 [9] "Users accessing CryoSat-2 products are intrinsically acknowledging and accepting the above." "Name"                                                                                      
[11] "Modified"                                                                                    "Size"                                                                                      
[13] "ESA European Space Agency" 

Upvotes: 2

Views: 33

Answers (2)

denis
denis

Reputation: 5673

Ok, here is a solution. In this kind of case where you can't get the info with regular scraping (rvest and so on), two solutions:

  • or you get the info with RSelenium and it can be tedious
  • or you inspect the page and monitor the XHR request (with the element inspector of firefox for example: network, and XHR).

You will find out that the data are loaded from urls looking like:

https://science-pds.cryosat.esa.int/?do=list&maxfiles=500&pos=5500&file=Cry0Sat2_data/SIR_SAR_L2/2021/02

These pages appear like html, but if you open them, they are not. They are JSON. The webpage just display dynamically the info requested as JSON. So you can simply get the info as follow:

url <- 'https://science-pds.cryosat.esa.int/?do=list&maxfiles=500&pos=5500&file=Cry0Sat2_data/SIR_SAR_L2/2021/02'
library(jsonlite)
fromJSON(url)

$success
[1] TRUE

$is_writable
[1] FALSE

$results
         mtime    size                                                        name
1   1616543429   37713 CS_OFFL_SIR_SAR_2__20210223T041611_20210223T042157_D001.HDR
2   1616543428  845594  CS_OFFL_SIR_SAR_2__20210223T041611_20210223T042157_D001.nc
3   1616543364   37713 CS_OFFL_SIR_SAR_2__20210223T043539_20210223T043844_D001.HDR
4   1616543363  528578  CS_OFFL_SIR_SAR_2__20210223T043539_20210223T043844_D001.nc
5   1616543321   37713 CS_OFFL_SIR_SAR_2__20210223T044915_20210223T045113_D001.HDR
6   1616543322  387650  CS_OFFL_SIR_SAR_2__20210223T044915_20210223T045113_D001.nc
7   1616543360   37713 CS_OFFL_SIR_SAR_2__20210223T045156_20210223T045427_D001.HDR
8   1616543359  456414  CS_OFFL_SIR_SAR_2__20210223T045156_20210223T045427_D001.nc
9   1616543328   37713 CS_OFFL_SIR_SAR_2__20210223T045551_20210223T045749_D001.HDR
10  1616543327  385998  CS_OFFL_SIR_SAR_2__20210223T045551_20210223T045749_D001.nc

                                                                                           path
1   Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T041611_20210223T042157_D001.HDR
2    Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T041611_20210223T042157_D001.nc
3   Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T043539_20210223T043844_D001.HDR
4    Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T043539_20210223T043844_D001.nc
5   Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T044915_20210223T045113_D001.HDR
6    Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T044915_20210223T045113_D001.nc
7   Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T045156_20210223T045427_D001.HDR
8    Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T045156_20210223T045427_D001.nc
9   Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T045551_20210223T045749_D001.HDR
10   Cry0Sat2_data/SIR_SAR_L2/2021/02/CS_OFFL_SIR_SAR_2__20210223T045551_20210223T045749_D001.nc

These should give you all the info you need. If you tweak a bit the url, you should be able to get all the info from the date you want.

Upvotes: 2

Leonardo Viotti
Leonardo Viotti

Reputation: 506

The problem seems to be that the page is dynamic. It probably has some JS code and only loads the links after it runs. So when you get the HTML from the link, you're only getting the base page (before the JS runs).

I can think of two possible solutions:

  • You can try to use selenium, which emulates an user in the browser so it will load the page completely, but the set-up might be a bit complicated. See for an intro https://www.r-bloggers.com/2014/12/scraping-with-selenium/
  • The page probably sends an HTTP request to get the links from an API, you can try to figure out the exact request. The network tab on you browser is a good place to start.

Upvotes: 1

Related Questions