blank
blank

Reputation: 119

Reading excel with R

I am trying to contemplate whether to read excel files directly from R or should I convert them to csv first. I have researched about the various possibilities of reading excel. I also found out that reading excel might have its cons like conversion of date and numeric column data types etc.

XLConnect - dependent on java
read.xslx - slow for large data sets
read.xslx2 - fast but need to use colClasses command to specify desired column classes
ODBC - may have conversion issues
gdata - dependent on perl

I am looking for a solution that will be fast enough for atleast a million rows with minimum data conversion issues . Any suggestions??

EDIT

So finally i have decided to convert to csv and then read the csv file but now I have to figure out the best way to read a large csv file(with atleast 1 million rows)

I found out about the read.csv.ffdf package but that does not let me set my own colClass. Specifically this

setAs("character","myDate", function(from){ classFun(from) } )
colClasses =c("numeric", "character", "myDate", "numeric", "numeric", "myDate")
z<-read.csv.ffdf(file=pathCsv,  colClasses=colClassesffdf)

This does not work and i get the following error :-

Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered, : vmode 'list' not implemented

I am also aware of the RSQlite and ODBC functionality but do not wish to use it . Is there a solution to the above error or any other way around this?

Upvotes: 2

Views: 1674

Answers (3)

Jack Wasey
Jack Wasey

Reputation: 3440

Since this question, Hadley Wickham has released the R package readxl which wraps C and C++ libraries to read both .xls and .xlsx files, respectively. It is a big improvement on the previous possibilities, but not without problems. It is fast and simple, but if you have messy data, you will have to do some work whichever method you choose. Going down the .csv route isn't a terrible idea, but does introduce a manual step in your analysis, and relies on whichever version of Excel you happen to use giving consistent CSV output.

Upvotes: 5

Greg Snow
Greg Snow

Reputation: 49640

If you want speed and large data, then you might consider converting your excel file(s) to a database format, then connect R to the database.

A quick Google search showed several links for converting Excel files to SQLite databases, then you could use the RSQlite or sqldf package to read into R.

Or use the ODBC package if you convert to one of the databases that work with ODBC. The conversion of fields problems should be less if you are do the conversion to database correctly.

Upvotes: 1

Drew Steen
Drew Steen

Reputation: 16607

All the solutions you mentioned will work - but if manually converting to .csv and reading with read.csv is an option, I'd recommend that. In my experience it is faster and easier to get right.

Upvotes: 4

Related Questions