Reputation: 119
I am trying to contemplate whether to read excel files directly from R or should I convert them to csv first. I have researched about the various possibilities of reading excel. I also found out that reading excel might have its cons like conversion of date and numeric column data types etc.
XLConnect - dependent on java
read.xslx - slow for large data sets
read.xslx2 - fast but need to use colClasses command to specify desired column classes
ODBC - may have conversion issues
gdata - dependent on perl
I am looking for a solution that will be fast enough for atleast a million rows with minimum data conversion issues . Any suggestions??
EDIT
So finally i have decided to convert to csv and then read the csv file but now I have to figure out the best way to read a large csv file(with atleast 1 million rows)
I found out about the read.csv.ffdf package but that does not let me set my own colClass. Specifically this
setAs("character","myDate", function(from){ classFun(from) } )
colClasses =c("numeric", "character", "myDate", "numeric", "numeric", "myDate")
z<-read.csv.ffdf(file=pathCsv, colClasses=colClassesffdf)
This does not work and i get the following error :-
Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered, : vmode 'list' not implemented
I am also aware of the RSQlite and ODBC functionality but do not wish to use it . Is there a solution to the above error or any other way around this?
Upvotes: 2
Views: 1674
Reputation: 3440
Since this question, Hadley Wickham has released the R package readxl which wraps C and C++ libraries to read both .xls
and .xlsx
files, respectively. It is a big improvement on the previous possibilities, but not without problems. It is fast and simple, but if you have messy data, you will have to do some work whichever method you choose. Going down the .csv
route isn't a terrible idea, but does introduce a manual step in your analysis, and relies on whichever version of Excel you happen to use giving consistent CSV output.
Upvotes: 5
Reputation: 49640
If you want speed and large data, then you might consider converting your excel file(s) to a database format, then connect R to the database.
A quick Google search showed several links for converting Excel files to SQLite databases, then you could use the RSQlite or sqldf package to read into R.
Or use the ODBC package if you convert to one of the databases that work with ODBC. The conversion of fields problems should be less if you are do the conversion to database correctly.
Upvotes: 1
Reputation: 16607
All the solutions you mentioned will work - but if manually converting to .csv
and reading with read.csv
is an option, I'd recommend that. In my experience it is faster and easier to get right.
Upvotes: 4