R rvest scraping reactive iframe table

Question

I'm trying to extract data (all countries and years) from a table within an iframe website. Some code I've tried is below.

From reading various posts on here I've managed to get the actual website address of the table inside the iframe https://apps7.unep.org/contribution/. I've tried several ways of extracting the data including various xpaths and html nodes, but nothing seems to be working. I haven't dealt with reactive tables before and would really appreciate some help. If anyone could provide me with the right code that would be absolutely fantastic.

I would be open to selenium is absolutely necessary, but dont know much about it and would much prefer rvest if at all possible

Thankyou in advance

library(rvest)
library(tidyverse)
library(dplyr)

# This address has the table within an iframe
url = "https://www.unep.org/about-un-environment/funding-and-partnerships/check-your-contributions"

# This table gets the website where the data in the iframe actually comes from 
base_website = url %>%
  read_html(url) %>%
  html_node("iframe") %>%
  html_attr("src") 

# I would like to get the data from the table on the above website 
my_table = read_html(base_website) %>%
  #extract(1)
  #html_node("body div") %>% 
  html_table()

QHarr · Accepted Answer

Explanation:

Each page has dynamic content including the iframe document. Data comes from API calls based on available year.

The years are determined by a js file https://apps7.unep.org/contribution/js/app.72fa461e.js which you can extract from the iframe document you requested.

currentYear: function() {
        var e = new Date;
        return e.getFullYear()
    },
    yearsRange: function() {
        for (var e = this.currentYear, t = [], a = e; a >= 1973; a--) t.push(a);
        return t
    }

You can extract the start year from this file, and use the same logic for the current year.

Next, generate the full set of years and combine with the base API endpoint:

years <- start_year:end_year

requests <- paste0("https://apps7.unep.org/dwh/api/v1/contribution/by-year/", years)

Issue requests to collect the JSON responses into a list. Then apply a custom function to extract the data as a DataFrame. This involves nested calls to generate DataFrames in order to handle the nested lists each API response contains resp$data$amt. The inner DataFrames get mapped to the final outer DataFrame.

You need to handle the fact there are differing numbers of items across years and geographies e.g. for 2019

I do this by merging with a DataFrame that has all possible keys present

map_dfr(., function(x) {
      merge(map(x, ~ ifelse(is.null(.x), NA, .x)) %>%
              data.frame(), df_blank, all = T) %>% .[rowSums(is.na(.)) != ncol(.), ]
    })

Within that same section, I also handle swopping out NULLs which start appearing for some region_name entries from 2019:

I also ensure to add in the year to the returned inner DataFrame.

You can re-order columns as you see fit.

It is pretty fast but you could start optimizing e.g. parallelize the requests.

R:

library(rvest)
library(jsonlite)
library(tidyverse)

year_data <- function(x) {
  info <- x$resp$data$amt
  year_name <- info %>% names()
  year <- year_name %>% as.integer()
  year_df <- info %>%
    .[[year_name]] %>%
    map_dfr(., function(x) {
      merge(map(x, ~ ifelse(is.null(.x), NA, .x)) %>%
              data.frame(), df_blank, all = T) %>% .[rowSums(is.na(.)) != ncol(.), ]
    }) %>%
    mutate(Year = year)
  return(year_df)
}

df_blank <- data.frame(
  "country_iso2" = NA_character_,
  "country_iso3" = NA_character_,
  "country_name" = NA_character_,
  "ec" = NA_character_,
  "ef_paid" = NA_integer_,
  "ef_pledged" = NA_integer_,
  "region_name" = NA_character_,
  "unep_region_code" = NA_character_,
  "visc" = NA_character_
)

url <- "https://www.unep.org/about-un-environment/funding-and-partnerships/check-your-contributions"

# This table gets the website where the data in the iframe actually comes from
contribution_iframe_src <- url %>%
  read_html(url) %>%
  html_element("iframe[src*='/contribution/']") %>%
  html_attr("src")

contribution_page <- read_html(contribution_iframe_src) # this is a dynamic page requiring JS to run. Instead, go to the JS file and extract required info.

contribution_years_js_file <- contribution_page %>%
  html_element("[href*='contribution/js']") %>%
  html_attr("href") %>%
  url_absolute(contribution_iframe_src)

js_file <- read_html(contribution_years_js_file)

start_year <- js_file %>%
  html_text() %>%
  str_match("yearsRange.*>=(\d{4});") %>%
  .[, 2] %>%
  as.integer()
end_year <- as.integer(format(Sys.Date(), "%Y"))

years <- start_year:end_year

requests <- paste0("https://apps7.unep.org/dwh/api/v1/contribution/by-year/", years)

data <- map(requests, read_json)

df <- map_dfr(data, year_data)

Sample output:

R rvest scraping reactive iframe table

Answers (2)

Related Questions