Scraping public data with migration

Question

I am trying to scraping public data from UNHCR web site.

All data which I want are stored here in button marked with yellow circle. If I press on this button with left click of mouse I will all data which is necessary for me in csv format.

But I want to download all data directly in R and I press right button of the mouse and I copy link address and put into this line of code

    # Dataset for scraping 
        https://www.unhcr.org/refugee-statistics/download/?url=E1ZxP4 
        
download.file(url = "https://api.unhcr.org/population/v1/population/?download=true#_ga=2.69829909.40031775.1622553152-198173155.1622026343", 
                              destfile = "C:/Users/User/Documents/Work/Data/DataScraping/Test/test.txt", mode = "wb")

At the end I downloads data but is not data which I expected, so can anybody help me how to solve this and download csv directly into R ?

Paulo Schau Guerra · Accepted Answer

You can access the data in JSON format via an API, as you might notice from inspecting the website's XHR files on the Network tab of the Inspect panel.

(Right-click this page and inspect it for fun!)

Try this code:

library(jsonlite)

# Base URL
url <- 'https://api.unhcr.org/population/v1/population/?'

# Query items
query_list = list(limit=100,
                  dataset='population',
                  displayType='totals',
                  'columns%5B%5D'='refugees',
                  'columns%5B%5D'='asylum_seekers',
                  'columns%5B%5D'='idps',
                  'columns%5B%5D'='vda',
                  'columns%5B%5D'='stateless',
                  'columns%5B%5D'='ooc',
                  yearFrom=1951,
                  yearTo=2020)

# Concatenates the query items to the base URL 
for (idx in seq_along(query_list)) {
  item_name <- names(query_list[idx])
  item_val <- query_list[[idx]]
  url <- paste0(url, item_name, '=', item_val, '&')
}

# Removes last character, i.e. &
url <- substr(url, 1, nchar(url)-1)

# Encodes URL to avoid errors
url <- URLencode(url)

# Extracts JSON from URL
json_extract <- fromJSON(url)

# Converts relevant list into a data.frame
df <- data.frame(json_extract[['items']])

Notice that the original request query had a limit of 20 results. The only change I made to it was to increase this limit to 100, which then returns all data at once.

The resulting data.frame's structure is the following:

> str(df)
'data.frame':   70 obs. of  15 variables:
  $ year          : int  1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 ...
$ coo_id        : chr  "-" "-" "-" "-" ...
$ coo_name      : chr  "-" "-" "-" "-" ...
$ coo           : chr  "-" "-" "-" "-" ...
$ coo_iso       : chr  "-" "-" "-" "-" ...
$ coa_id        : chr  "-" "-" "-" "-" ...
$ coa_name      : chr  "-" "-" "-" "-" ...
$ coa           : chr  "-" "-" "-" "-" ...
$ coa_iso       : chr  "-" "-" "-" "-" ...
$ refugees      : int  2116011 1952928 1847304 1749628 1717966 1767975 1742514 1698310 1674185 1656664 ...
$ asylum_seekers: chr  "0" "0" "0" "0" ...
$ idps          : chr  "0" "0" "0" "0" ...
$ vda           : int  NA NA NA NA NA NA NA NA NA NA ...
$ stateless     : chr  "0" "0" "0" "0" ...

And here is its summary:

> summary(df)
      year         coo_id            coo_name             coo           
 Min.   :1951   Length:70          Length:70          Length:70         
 1st Qu.:1968   Class :character   Class :character   Class :character  
 Median :1986   Mode  :character   Mode  :character   Mode  :character  
 Mean   :1986                                                           
 3rd Qu.:2003                                                           
 Max.   :2020                                                           
                                                                        
   coo_iso             coa_id            coa_name             coa           
 Length:70          Length:70          Length:70          Length:70         
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
   coa_iso             refugees        asylum_seekers         idps          
 Length:70          Min.   : 1656664   Length:70          Length:70         
 Class :character   1st Qu.: 2924617   Class :character   Class :character  
 Mode  :character   Median :10098181   Mode  :character   Mode  :character  
                    Mean   : 8862333                                        
                    3rd Qu.:12507743                                        
                    Max.   :20676358                                        
                                                                            
      vda           stateless             ooc           
 Min.   :2592947   Length:70          Length:70         
 1st Qu.:3087436   Class :character   Class :character  
 Median :3581926   Mode  :character   Mode  :character  
 Mean   :3252358                                        
 3rd Qu.:3582064                                        
 Max.   :3582202                                        
 NA's   :67

Scraping public data with migration

Answers (1)

Related Questions