Ben Mat
Ben Mat

Reputation: 41

How to simply download an Excel File directly from the Internet with R

I have the URL of a file I want to download and from here directly read into R. It is an .xlsx file from Gapminder. The Webiste to find it is "https://github.com/Gapminder-Indicators/u5mr/blob/master/u5mr-by-gapminder.xlsx".

I have tried two things:

url <- "https://www.gapminder.org/data/documentation/gd005/u5mr-by-gapminder.xlsx"
download.file(url,destfile="example.xlsx")
example <- read_excel("example.xlsx")

with the error

> url <- "https://github.com/Gapminder-Indicators/u5mr/blob/master/u5mr-by-gapminder.xlsx"
> download.file(url,destfile="example.xlsx")
trying URL 'https://github.com/Gapminder-Indicators/u5mr/blob/master/u5mr-by-gapminder.xlsx'
Content type 'text/html; charset=utf-8' length unknown
downloaded 61 KB

> example <- read_excel("example.xlsx")
Error: Evaluation error: zip file 'C:\Users\user\Documents\example.xlsx' cannot be opened.
> 

and

library(RCurl)
URL <- "https://www.gapminder.org/data/documentation/gd005/u5mr-by-gapminder.xlsx"
x <- getURL(URL)
out <- read.csv(textConnection(x))
head(out[1:6])

with the result

> library(RCurl)
> URL <- "https://www.gapminder.org/data/documentation/gd005/u5mr-by-gapminder.xlsx"
> x <- getURL(URL)
> out <- read.csv(textConnection(x))
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  EOF within quoted string
> head(out[1:6])
Error in `[.data.frame`(out, 1:6) : undefined columns selected

So the file is either not downloaded at all or incorrectly read in. How can I simply download that one excel-file from the webpage and read in?

Upvotes: 0

Views: 749

Answers (2)

Allan Cameron
Allan Cameron

Reputation: 173793

I don't think just changing the url as @James says is enough, (note that the url in question forwards your browser to a different url with the actual file - in your case it's https://raw.githubusercontent.com/Gapminder-Indicators/u5mr/master/u5mr-by-gapminder.xlsx).

But this is not the only problem. You need to write the downloaded file in binary format using mode = "wb" before you can open it with read_excel, also specifying the worksheet that you want to open.

Here's a working example:

library(readxl)

destfile <- path.expand("~/example.xlsx")
url      <- paste0("https://raw.githubusercontent.com/Gapminder-Indicators",
                    "/u5mr/master/u5mr-by-gapminder.xlsx")

download.file(url, destfile = destfile, mode = "wb")
example <- read_excel(destfile, sheet = 2)

and now you have the spreadsheet as a tibble:

print(example)
#> # A tibble: 275 x 305
#>    geo.name indicator.name geo   indicator `1800` `1801` `1802` `1803` `1804` `1805`
#>    <chr>    <chr>          <chr> <chr>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#>  1 Abkhazia Child mortali~ abkh  u5mr         NA     NA     NA     NA     NA     NA 
#>  2 Afghani~ Child mortali~ afg   u5mr        469.   469.   469.   469.   469.   469.
#>  3 Akrotir~ Child mortali~ akr_~ u5mr         NA     NA     NA     NA     NA     NA 
#>  4 Albania  Child mortali~ alb   u5mr        375.   375.   375.   375.   375.   375.
#>  5 Algeria  Child mortali~ dza   u5mr        460.   460.   460.   460.   460.   460.
#>  6 America~ Child mortali~ asm   u5mr         NA     NA     NA     NA     NA     NA 
#>  7 Andorra  Child mortali~ and   u5mr         NA     NA     NA     NA     NA     NA 
#>  8 Angola   Child mortali~ ago   u5mr        486.   486.   486.   486.   486.   486.
#>  9 Anguilla Child mortali~ aia   u5mr         NA     NA     NA     NA     NA     NA 
#> 10 Antigua~ Child mortali~ atg   u5mr        474.   470.   466.   462.   458.   455.
#> # ... with 265 more rows, and 295 more variables: `1806` <dbl>, `1807` <dbl>

Upvotes: 4

James
James

Reputation: 66834

Despite the URL, the GitHub server doesn't actually serve the file unless raw=true is passed in as a parameter.

Use this URL instead: https://github.com/Gapminder-Indicators/u5mr/blob/master/u5mr-by-gapminder.xlsx?raw=true

Upvotes: 4

Related Questions