GT.
GT.

Reputation: 792

Can't import this excel file into R

I'm having trouble importing a file into R. The file was obtained from this website: https://report.nih.gov/award/index.cfm, where I clicked "Import Table" and downloaded a .xls file for the year 1992.

This image might help describe how I retrieved the data Image describing how I retrieved the data from the NIH database

Here's what I've tried typing into the console, along with the results:

Input:

> library('readxl')
> data1992 <- read_excel("1992.xls")

Output:

Not an excel file
Error in eval(substitute(expr), envir, enclos) : 
  Failed to open /home/chrx/Documents/NIH Funding Awards, 1992 - 2016/1992.xls

Input:

> data1992 <- read.csv ("1992.xls", sep ="\t")

Output:

Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
  more columns than column names

I'm not sure whether or not this is relevant, but I'm using GalliumOS (linux). Because I'm using Linux, Excel isn't installed on my computer. LibreOffice is.

Upvotes: 1

Views: 4004

Answers (2)

C8H10N4O2
C8H10N4O2

Reputation: 19025

Why bother with getting the data in and out of a .csv if it's right there on the web page for you to scrape?

# note the query parameters in the url when you apply a filter, e.g. fy=
url <- 'http://report.nih.gov/award/index.cfm?fy=1992'

library('rvest')
library('magrittr')
library('dplyr')
df <- url %>%
        read_html() %>%
        html_nodes(xpath='//*[@id="orgtable"]') %>%
        html_table()%>% 
        extract2(1) %>%
        mutate(Funding = as.numeric(gsub('[^0-9.]','',Funding)))

head(df)

returns

                              Organization          City State       Country Awards Funding
1 A.T. STILL UNIVERSITY OF HEALTH SCIENCES    KIRKSVILLE    MO UNITED STATES      3  356221
2                     AAC ASSOCIATES, INC.        VIENNA    VA UNITED STATES     10 1097158
3       AARON DIAMOND AIDS RESEARCH CENTER      NEW YORK    NY UNITED STATES      3  629946
4                      ABBOTT LABORATORIES NORTH CHICAGO    IL UNITED STATES      4 1757241
5                            ABIOMED, INC.       DANVERS    MA UNITED STATES      6 2161146
6                     ABRATECH CORPORATION     SAUSALITO    CA UNITED STATES      1  450411

If you need to loop through years 1992 to present, or something similar, this programmatic approach will save you a lot of time versus handling a bunch of flat files.

Upvotes: 3

Hack-R
Hack-R

Reputation: 23241

This works for me

library(gdata)
dat1 <- read.xls("1992.xls")

If you're on 32-bit Windows this will also work:

require(RODBC)
dat1 <- odbcConnectExcel("1992.xls")

For several more options that rely on rJava-based packages like xlsx you can check out this link.

As someone mentioned in the comments it's also easy to save the file as a .csv and read it in that way. This will save you the trouble of dealing with the effects of strange formatting or metadata on your imported file:

dat1 <- read.csv("1992.csv")

head(dat1)
                              ORGANIZATION          CITY STATE       COUNTRY AWARDS     FUNDING
1 A.T. STILL UNIVERSITY OF HEALTH SCIENCES    KIRKSVILLE    MO UNITED STATES      3   $356,221 
2                     AAC ASSOCIATES, INC.        VIENNA    VA UNITED STATES     10 $1,097,158 
3       AARON DIAMOND AIDS RESEARCH CENTER      NEW YORK    NY UNITED STATES      3   $629,946 
4                      ABBOTT LABORATORIES NORTH CHICAGO    IL UNITED STATES      4 $1,757,241 
5                            ABIOMED, INC.       DANVERS    MA UNITED STATES      6 $2,161,146 
6                     ABRATECH CORPORATION     SAUSALITO    CA UNITED STATES      1   $450,411

Converting to .csv is also usually the fastest way in my opinion (though this is only an issue with Big Data).

Upvotes: 2

Related Questions