samq
samq

Reputation: 11

using rvest to read table from web

I'm new to R and web scraping. I'm trying to read a table from the World Bank website into R.

Here is the url link for one of the projects as an example (my goal is to read the left table under "Basic Information"): http://projects.worldbank.org/P156880/?lang=en&tab=details

I'm using Chrome's Dedvtools to identify the selector nodes that i need for that particular table.

Here is my code:

library(rvest)
url <- "http://projects.worldbank.org/P156880/?lang=en&tab=details"
details <- url %>% 
        read_html() %>% 
        html_nodes(css = '#projectDetails > div:nth-child(2) > div.column-left > table') %>%
        html_table()

Unfortunately, I get an empty list:

> details
list()

Any help on how to resolve this would be greatly appreciated.

Upvotes: 0

Views: 1102

Answers (1)

Chris S.
Chris S.

Reputation: 2225

This site uses XML http requests which you can get using httr. Open Chrome developer tools and go to the Network tab and then load your url above. You will notice four other urls are requested when loading the page, so click on projectdetails? and you should see the html table in the Preview tab. Next, right click on projectdetails? and Copy as cURL to a text editor and paste the URL, Referer, and X-Requested-With into the httr GET function below.

library(httr)
library(rvest)

res <- GET(
  url = "http://projects.worldbank.org/p2e/projectdetails?projId=P156880&lang=en",
  add_headers(Referer = "http://projects.worldbank.org/P156880/?lang=en&tab=details", 
   `X-Requested-With` = "XMLHttpRequest")
)  
content(res) %>% html_node("table") %>% html_table( header=TRUE)
                Project ID                     P156880
  1                 Status                      Active
  2          Approval Date           December 14, 2017
  3           Closing Date           December 15, 2023
  4                Country                    Colombia
  5                 Region Latin America and Caribbean
  6 Environmental Category                           B

Or write a function to get any project ID

 get_project <-function(id){
   res <- GET(
     url = "http://projects.worldbank.org",
    path = paste0("p2e/projectdetails?projId=", id, "&lang=en"),
    add_headers(
      Referer = paste0("http://projects.worldbank.org/", id, "/?lang=en&tab=details"), 
      `X-Requested-With` = "XMLHttpRequest")
  ) 
  content(res) %>% html_node("table") %>% html_table(header=TRUE)
}
get_project("P156880")

Upvotes: 1

Related Questions