Web scraping, unable to target a table

Question

I am having a problem trying to target an embedded table to collect a second set of links. It is an embarrassingly simple task.

The website: https://irma.nps.gov/DataStore/Reference/Profile/2233469 while they have an API for other products these data are not included... (I have a key - which is appended to the end of the URL- if needed, but it doesn't seem to affect this type of page).

What I am trying to do is copy all of the links which are contained in the table. Or downloading the contents of the entire table and converting it to a dataframe in R (I have achieved this with simpler tables). I should not have difficulties with that part! Although, I also believed I would be able to crack this table pretty easily...

I have followed a few different guides and questions to try and approach this problem, but I keep hitting a wall. I have been hoping to accomplish this with rvest/xml2/httr/jsonlite suite of packages, and am still not convinced I need RSelenium for this.

page <- read_html('https://irma.nps.gov/DataStore/Reference/Profile/2233469')

Approach 1.

app1 <- html_nodes(page, "body")
app1 <- app1[[1]]
app1 <- app1 %>% html_attr('href')

I have tried a few ways to get data out of here, but the common ones seem to fail e.g. html_table etc.

Approach 2. (poorly formatted)

  app2 <- page %>% 
     html_nodes(xpath = '//*[(@id = 
    "digitalResourcesGrid-body")]') %>% 
  html_attr('href')

Approach 3.

app3 <- page %>% html_nodes('div.x-grid-view x-fit-item x-grid-view-default') 
app3 <- app3 %>% html_nodes("a")  
app3 <- xml_text(app3)

I've tried a dozen permutations on different functions without success. Each approach fails somewhere along the process, but it seems that anyone of these approaches should in theory work?
Any help with successfully targeting this table in the first place would be greatly appreciated, I do think I can get what I want out of it if only I can access it.

Dave2e · Accepted Answer

It looks like the target table is stored as a JSON file, it is easier use the developers tools in your web browser to find the address and then download the file directly.
In the developers tools, go to the network tab, filter for the XHR files and reload the webpage. A couple of files should be listed, look at each one to find the file containing the desired information. Right click the file to copy its URL.

library(jsonlite)
webpagetable <- fromJSON("https://irma.nps.gov/DataStore/Reference/GetHoldings?_dc=1609810155944&referenceId=2233469&page=1&start=0&limit=25&sort=%5B%7B%22property%22%3A%22DisplayOrder%22%2C%22direction%22%3A%22ASC%22%7D%2C%7B%22property%22%3A%22HoldingType%22%2C%22direction%22%3A%22ASC%22%7D%2C%7B%22property%22%3A%22Description%22%2C%22direction%22%3A%22ASC%22%7D%5D")

head(webpagetable)

Web scraping, unable to target a table

Answers (1)

Related Questions