Il Forna
Il Forna

Reputation: 23

R web scraping charts on multiple pages

Maybe this subject is treated in other posts but I cannot find a solution to my issue. I am trying to scrape data from https://tradingeconomics.com/indicators website. I am trying to scrape data regarding indicators, in particular the country names and the plots included in any country link.

tradec = function(tradelink) {
trade_page = read_html(tradelink)
trade_element = trade_page %>% html_nodes(".primary_photo+ td a") %>%
html_text() %>% paste(collapse = ",")
return(trade_element)
}

main_page <- read_html("https://tradingeconomics.com/country-list/gdp-growth-rate")
country_list <-  main_page %>% 
html_nodes("#ctl00_ContentPlaceHolder1_ctl01_UpdatePanel1 a") %>% 
html_text() %>% 
trimws %>% 
gsub(" ", "-", .)


tradec_df = data.frame()

for (i in country_list) {
link = paste0("https://tradingeconomics.com/", i , "/gdp-growth")
page = read_html(link)

country = page %>% html_nodes("#SelectCountries") %>% html_text()
tradec_charts = page %>% html_nodes("#ImageChart") %>% html_text

tradec_df = rbind(tradec_df, data.frame(country, tradec_charts, stringsAsFactors = FALSE))
print(paste("Page:", country_list)) 

} 

In an ideal world, I would like to have a page printed for each country including country name and the plot. I am pretty sure that plots might be scraped in some way and displayed though I have no idea about how. Any suggestion?

Upvotes: 0

Views: 337

Answers (1)

stevec
stevec

Reputation: 52468

It's not working because each element in the countries variable contains illegal characters:

 [1] "\r\n                                        South Africa\r\n                                    "          
 [2] "\r\n                                        Peru\r\n                                    "                  
 [3] "\r\n                                        Botswana\r\n                                    "   

So all you need to do is remove those characters with trimws(), so they look like this instead:

country_list
 [1] "South Africa"           "Peru"                   "Botswana"               "India"                  "Turkey"                
 [6] "New Zealand"            "Argentina"              "Malta"                  "Slovenia"               "El Salvador"           
[11] "Ireland"                "Rwanda"                 "Albania"                "Luxembourg"             "Nigeria"               
[16] "Canada"                 "Jamaica"                "Uruguay"                "Brazil"                 "Paraguay"  

This works. The only line I changed was to add the pipe to trimws():

library(tidyverse)
library(rvest)


tradec = function(tradelink) {
trade_page = read_html(tradelink)
trade_element = trade_page %>% html_nodes(".primary_photo+ td a") %>%
html_text() %>% paste(collapse = ",")
return(trade_element)
}

main_page <- read_html("https://tradingeconomics.com/country-list/gdp-growth-rate")
country_list <-  main_page %>% 
  html_nodes("#ctl00_ContentPlaceHolder1_ctl01_UpdatePanel1 a") %>% 
  html_text() %>% 
  trimws


tradec_df = data.frame()

for (i in country_list) {
  link = paste0("https://tradingeconomics.com/", i , "/gdp-growth")
  page = read_html(link)
  
  country = page %>% html_nodes("#SelectCountries") %>% html_text()
  tradec_links = page %>% html_nodes("#ImageChart") %>% html_text
}

Upvotes: 1

Related Questions