Reputation: 2508
I am trying to scraping public data from UNHCR web site.
All data which I want are stored here in button marked with yellow circle. If I press on this button with left click of mouse I will all data which is necessary for me in csv format.
But I want to download all data directly in R and I press right button of the mouse and I copy link address and put into this line of code
# Dataset for scraping
https://www.unhcr.org/refugee-statistics/download/?url=E1ZxP4
download.file(url = "https://api.unhcr.org/population/v1/population/?download=true#_ga=2.69829909.40031775.1622553152-198173155.1622026343",
destfile = "C:/Users/User/Documents/Work/Data/DataScraping/Test/test.txt", mode = "wb")
At the end I downloads data but is not data which I expected, so can anybody help me how to solve this and download csv directly into R ?
Upvotes: 0
Views: 101
Reputation: 631
You can access the data in JSON format via an API, as you might notice from inspecting the website's XHR files on the Network tab of the Inspect panel.
(Right-click this page and inspect it for fun!)
Try this code:
library(jsonlite)
# Base URL
url <- 'https://api.unhcr.org/population/v1/population/?'
# Query items
query_list = list(limit=100,
dataset='population',
displayType='totals',
'columns%5B%5D'='refugees',
'columns%5B%5D'='asylum_seekers',
'columns%5B%5D'='idps',
'columns%5B%5D'='vda',
'columns%5B%5D'='stateless',
'columns%5B%5D'='ooc',
yearFrom=1951,
yearTo=2020)
# Concatenates the query items to the base URL
for (idx in seq_along(query_list)) {
item_name <- names(query_list[idx])
item_val <- query_list[[idx]]
url <- paste0(url, item_name, '=', item_val, '&')
}
# Removes last character, i.e. &
url <- substr(url, 1, nchar(url)-1)
# Encodes URL to avoid errors
url <- URLencode(url)
# Extracts JSON from URL
json_extract <- fromJSON(url)
# Converts relevant list into a data.frame
df <- data.frame(json_extract[['items']])
Notice that the original request query had a limit of 20 results. The only change I made to it was to increase this limit to 100, which then returns all data at once.
The resulting data.frame's structure is the following:
> str(df)
'data.frame': 70 obs. of 15 variables:
$ year : int 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 ...
$ coo_id : chr "-" "-" "-" "-" ...
$ coo_name : chr "-" "-" "-" "-" ...
$ coo : chr "-" "-" "-" "-" ...
$ coo_iso : chr "-" "-" "-" "-" ...
$ coa_id : chr "-" "-" "-" "-" ...
$ coa_name : chr "-" "-" "-" "-" ...
$ coa : chr "-" "-" "-" "-" ...
$ coa_iso : chr "-" "-" "-" "-" ...
$ refugees : int 2116011 1952928 1847304 1749628 1717966 1767975 1742514 1698310 1674185 1656664 ...
$ asylum_seekers: chr "0" "0" "0" "0" ...
$ idps : chr "0" "0" "0" "0" ...
$ vda : int NA NA NA NA NA NA NA NA NA NA ...
$ stateless : chr "0" "0" "0" "0" ...
And here is its summary:
> summary(df)
year coo_id coo_name coo
Min. :1951 Length:70 Length:70 Length:70
1st Qu.:1968 Class :character Class :character Class :character
Median :1986 Mode :character Mode :character Mode :character
Mean :1986
3rd Qu.:2003
Max. :2020
coo_iso coa_id coa_name coa
Length:70 Length:70 Length:70 Length:70
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
coa_iso refugees asylum_seekers idps
Length:70 Min. : 1656664 Length:70 Length:70
Class :character 1st Qu.: 2924617 Class :character Class :character
Mode :character Median :10098181 Mode :character Mode :character
Mean : 8862333
3rd Qu.:12507743
Max. :20676358
vda stateless ooc
Min. :2592947 Length:70 Length:70
1st Qu.:3087436 Class :character Class :character
Median :3581926 Mode :character Mode :character
Mean :3252358
3rd Qu.:3582064
Max. :3582202
NA's :67
Upvotes: 1