Reputation: 792
I'm having trouble importing a file into R. The file was obtained from this website: https://report.nih.gov/award/index.cfm, where I clicked "Import Table" and downloaded a .xls file for the year 1992.
This image might help describe how I retrieved the data
Here's what I've tried typing into the console, along with the results:
Input:
> library('readxl')
> data1992 <- read_excel("1992.xls")
Output:
Not an excel file
Error in eval(substitute(expr), envir, enclos) :
Failed to open /home/chrx/Documents/NIH Funding Awards, 1992 - 2016/1992.xls
Input:
> data1992 <- read.csv ("1992.xls", sep ="\t")
Output:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
I'm not sure whether or not this is relevant, but I'm using GalliumOS (linux). Because I'm using Linux, Excel isn't installed on my computer. LibreOffice is.
Upvotes: 1
Views: 4004
Reputation: 19025
Why bother with getting the data in and out of a .csv if it's right there on the web page for you to scrape?
# note the query parameters in the url when you apply a filter, e.g. fy=
url <- 'http://report.nih.gov/award/index.cfm?fy=1992'
library('rvest')
library('magrittr')
library('dplyr')
df <- url %>%
read_html() %>%
html_nodes(xpath='//*[@id="orgtable"]') %>%
html_table()%>%
extract2(1) %>%
mutate(Funding = as.numeric(gsub('[^0-9.]','',Funding)))
head(df)
returns
Organization City State Country Awards Funding
1 A.T. STILL UNIVERSITY OF HEALTH SCIENCES KIRKSVILLE MO UNITED STATES 3 356221
2 AAC ASSOCIATES, INC. VIENNA VA UNITED STATES 10 1097158
3 AARON DIAMOND AIDS RESEARCH CENTER NEW YORK NY UNITED STATES 3 629946
4 ABBOTT LABORATORIES NORTH CHICAGO IL UNITED STATES 4 1757241
5 ABIOMED, INC. DANVERS MA UNITED STATES 6 2161146
6 ABRATECH CORPORATION SAUSALITO CA UNITED STATES 1 450411
If you need to loop through years 1992 to present, or something similar, this programmatic approach will save you a lot of time versus handling a bunch of flat files.
Upvotes: 3
Reputation: 23241
This works for me
library(gdata)
dat1 <- read.xls("1992.xls")
If you're on 32-bit Windows this will also work:
require(RODBC)
dat1 <- odbcConnectExcel("1992.xls")
For several more options that rely on rJava
-based packages like xlsx
you can check out this link.
As someone mentioned in the comments it's also easy to save the file as a .csv and read it in that way. This will save you the trouble of dealing with the effects of strange formatting or metadata on your imported file:
dat1 <- read.csv("1992.csv")
head(dat1)
ORGANIZATION CITY STATE COUNTRY AWARDS FUNDING 1 A.T. STILL UNIVERSITY OF HEALTH SCIENCES KIRKSVILLE MO UNITED STATES 3 $356,221 2 AAC ASSOCIATES, INC. VIENNA VA UNITED STATES 10 $1,097,158 3 AARON DIAMOND AIDS RESEARCH CENTER NEW YORK NY UNITED STATES 3 $629,946 4 ABBOTT LABORATORIES NORTH CHICAGO IL UNITED STATES 4 $1,757,241 5 ABIOMED, INC. DANVERS MA UNITED STATES 6 $2,161,146 6 ABRATECH CORPORATION SAUSALITO CA UNITED STATES 1 $450,411
Converting to .csv is also usually the fastest way in my opinion (though this is only an issue with Big Data).
Upvotes: 2