Reputation: 1163
I'm trying to extract data (all countries and years) from a table within an iframe website. Some code I've tried is below.
From reading various posts on here I've managed to get the actual website address of the table inside the iframe https://apps7.unep.org/contribution/. I've tried several ways of extracting the data including various xpaths and html nodes, but nothing seems to be working. I haven't dealt with reactive tables before and would really appreciate some help. If anyone could provide me with the right code that would be absolutely fantastic.
I would be open to selenium is absolutely necessary, but dont know much about it and would much prefer rvest if at all possible
Thankyou in advance
library(rvest)
library(tidyverse)
library(dplyr)
# This address has the table within an iframe
url = "https://www.unep.org/about-un-environment/funding-and-partnerships/check-your-contributions"
# This table gets the website where the data in the iframe actually comes from
base_website = url %>%
read_html(url) %>%
html_node("iframe") %>%
html_attr("src")
# I would like to get the data from the table on the above website
my_table = read_html(base_website) %>%
#extract(1)
#html_node("body div") %>%
html_table()
Upvotes: 0
Views: 299
Reputation: 84465
Explanation:
Each page has dynamic content including the iframe document. Data comes from API calls based on available year.
The years are determined by a js file https://apps7.unep.org/contribution/js/app.72fa461e.js
which you can extract from the iframe document you requested.
currentYear: function() {
var e = new Date;
return e.getFullYear()
},
yearsRange: function() {
for (var e = this.currentYear, t = [], a = e; a >= 1973; a--) t.push(a);
return t
}
You can extract the start year from this file, and use the same logic for the current year.
Next, generate the full set of years and combine with the base API endpoint:
years <- start_year:end_year
requests <- paste0("https://apps7.unep.org/dwh/api/v1/contribution/by-year/", years)
Issue requests to collect the JSON responses into a list. Then apply a custom function to extract the data as a DataFrame. This involves nested calls to generate DataFrames in order to handle the nested lists each API response contains resp$data$amt
. The inner DataFrames get mapped to the final outer DataFrame.
You need to handle the fact there are differing numbers of items across years and geographies e.g. for 2019
I do this by merging with a DataFrame that has all possible keys present
map_dfr(., function(x) {
merge(map(x, ~ ifelse(is.null(.x), NA, .x)) %>%
data.frame(), df_blank, all = T) %>% .[rowSums(is.na(.)) != ncol(.), ]
})
Within that same section, I also handle swopping out NULLs which start appearing for some region_name entries from 2019:
I also ensure to add in the year to the returned inner DataFrame.
You can re-order columns as you see fit.
It is pretty fast but you could start optimizing e.g. parallelize the requests.
R:
library(rvest)
library(jsonlite)
library(tidyverse)
year_data <- function(x) {
info <- x$resp$data$amt
year_name <- info %>% names()
year <- year_name %>% as.integer()
year_df <- info %>%
.[[year_name]] %>%
map_dfr(., function(x) {
merge(map(x, ~ ifelse(is.null(.x), NA, .x)) %>%
data.frame(), df_blank, all = T) %>% .[rowSums(is.na(.)) != ncol(.), ]
}) %>%
mutate(Year = year)
return(year_df)
}
df_blank <- data.frame(
"country_iso2" = NA_character_,
"country_iso3" = NA_character_,
"country_name" = NA_character_,
"ec" = NA_character_,
"ef_paid" = NA_integer_,
"ef_pledged" = NA_integer_,
"region_name" = NA_character_,
"unep_region_code" = NA_character_,
"visc" = NA_character_
)
url <- "https://www.unep.org/about-un-environment/funding-and-partnerships/check-your-contributions"
# This table gets the website where the data in the iframe actually comes from
contribution_iframe_src <- url %>%
read_html(url) %>%
html_element("iframe[src*='/contribution/']") %>%
html_attr("src")
contribution_page <- read_html(contribution_iframe_src) # this is a dynamic page requiring JS to run. Instead, go to the JS file and extract required info.
contribution_years_js_file <- contribution_page %>%
html_element("[href*='contribution/js']") %>%
html_attr("href") %>%
url_absolute(contribution_iframe_src)
js_file <- read_html(contribution_years_js_file)
start_year <- js_file %>%
html_text() %>%
str_match("yearsRange.*>=(\\d{4});") %>%
.[, 2] %>%
as.integer()
end_year <- as.integer(format(Sys.Date(), "%Y"))
years <- start_year:end_year
requests <- paste0("https://apps7.unep.org/dwh/api/v1/contribution/by-year/", years)
data <- map(requests, read_json)
df <- map_dfr(data, year_data)
Sample output:
Upvotes: 3
Reputation: 98
I suggest using package datapasta
. Copy and paste the rows and columns to a text editor to clean the names row. Then copy and paste using datapasta
paste_as_tribble in the Addins section.
library(datapasta)
mydata<- tibble::tribble(.... ) head(mydata)
#A tibble: 6 x 6
Country Region Share Pledge Paid Unpaid
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Netherlands Europe 9090909. 9229600 9229600 0
2 Germany Europe 8857042. 8887815. 8887815. 0
3 France Europe 7550550 7550550 7550550 0
4 United States of America North America 0 6615000 6615000 0
5 Belgium Europe 4687600 6110370 6110370 0
6 Sweden Europe 5053036 5053036 5053036 0
Upvotes: 1