Reputation: 309
I have to read a CSV (each more than 120MB). I use a for loop, but it was very very very slow. How can I read a CSV more quickly?
My code:
H=data.frame()
for (i in 201:225){
for (j in 1996:2007){
filename=paste("D:/Hannah/CD/CD.R",i,"_cd",j,".csv",sep="")
x=read.csv(filename,stringsAsFactor=F)
I=c("051","041","044","54","V0262")
temp=x[(x$A_1 %in% I)|(x$A_2 %in% I)|(x$A_3 %in% I), ]
H=rbind(H,temp)
}
}
each files structuration are same like this
> str(x)
'data.frame': 417691 obs. of 37 variables:
$ YM: int 199604 199612 199612 199612 199606 199606 199609 199601 ...
$ A_TYPE: int 1 1 1 1 1 1 1 1 1 1 ...
$ HOSP: chr "dd0516ed3e" "c53d67027e" ...
$ A_DATE: int 19960505 19970116 19970108 ...
$ C_TYPE: int 19 9 1 1 2 9 9 1 1 1 ...
$ S_NO : int 142 37974 4580 4579 833 6846 2272 667 447 211 ...
$ C_ITEM_1 : chr "P2" "P3" "A2"...
$ C_ITEM_2 : chr "R6" "I3" ""...
$ C_ITEM_3 : chr "W2" "" "A2"...
$ C_ITEM_4 : chr "Y1" "O3" ""...
$ F_TYPE: chr "40" "02" "02" "02" ...
$ F_DATE : int 19960415 19961223 19961227 ...
$ T_END_DATE: int NA NA NA ...
$ ID_B : int 19630526 19630526 19630526 ...
$ ID : chr "fff" "fac" "eab"...
$ CAR_NO : chr "B4" "B5" "C1" "B6" ...
$ GE_KI: int 4 4 4 4 4 4 4 4 4 4 ...
$ PT_N : chr "H10" "A10" "D10" "D10" ...
$ A_1 : chr "0521" "7948" "A310" "A312" ...
$ A_2 : chr "05235" "5354" "" "" ...
$ A_3 : chr "" "" "" "" ...
$ I_O_CE: chr "5210" "" "" "" ...
$ DR_DAY : int 0 7 3 3 0 0 3 3 3 3 ...
$ M_TYPE: int 2 0 0 0 2 2 0 0 0 0 ...
........
Upvotes: 3
Views: 995
Reputation: 7659
This may not be the most efficient or most elegant approach, but here is what I would do, based upon some assumptions where more info is missing; particularly, can't do any testing:
Make sure that RSQLite
is installed (sqldf
could be an option if you have enough memory, but personally I prefer having a "real" database that I also can access with other tools).
# make sqlite available
library( RSQLite )
db <- dbConnect( dbDriver("SQLite"), dbname = "hannah.sqlite" )
# create a vector with your filenames
filenames <- NULL
for (i in 201:225)
{
for ( j in 1996:2007 )
{
fname <- paste( "D:/Hannah/CD/CD.R", i, "_cd", j, ".csv", sep="" )
filenames <- c( filenames, fname )
}
}
# extract the DB structure, create empty table
x <- read.csv( filenames[1], stringsAsFactor = FALSE, nrows = 1 )
dbWriteTable( db, "all", x, row.names = FALSE )
dbGetQuery( db, "DELETE FROM all" )
# a small table for your selection criteria (build in flexibility for the future)
I <- as.data.frame( c( "051", "041", "044", "54", "V0262" ) )
dbWriteTable( db, "crit", I, row.names = FALSE )
# move your 300 .csv files into that table
# (you probably do that better using the sqlite CLI but more info would be needed)
for( f in filenames )
{
x <- read.csv( f, stringsAsFactor = FALSE )
dbWriteTable( db, "all", x, append = TRUE, row.names = FALSE )
}
# now you can extract the subset in one go
extract <- dbGetQuery( db, "SELECT * FROM all
WHERE A_1 IN (SELECT I FROM crit ) OR
A_2 IN (SELECT I FROM crit ) OR
A_3 IN (SELECT I FROM crit )" )
This is not tested but should work (if not, tell me where it stops) and it should be faster and not run into memory problems. But again, without real data no real solution!
Upvotes: 2
Reputation: 3291
You could also use function fread()
from the data.table
package. It's pretty fast compared to read.csv
. Also, try to just loop over list.files()
.
Upvotes: 2
Reputation: 60984
I think the big performance problem here is that you iteratively grow the H
object. Each time the object grows, the OS needs to allocate more for it. This process takes quite long. A simple fix would be to preallocate H
to the correct number of rows. If the number of rows is not known beforehand, you can preallocate a good amount, and resize as needed.
Alternatively, the following approach does not suffer form the problem I describe above:
list_of_files = list.files('dir_where_files_are', pattern = '*csv', full.names = TRUE)
big_data_frame = do.call('rbind', lapply(list_of_files, read.csv, sep = ""))
Upvotes: 4