Vivian
Vivian

Reputation: 309

How can I read a CSV more quickly in R?

I have to read a CSV (each more than 120MB). I use a for loop, but it was very very very slow. How can I read a CSV more quickly?

My code:

H=data.frame()
for (i in 201:225){
    for (j in 1996:2007){
        filename=paste("D:/Hannah/CD/CD.R",i,"_cd",j,".csv",sep="")
        x=read.csv(filename,stringsAsFactor=F)
        I=c("051","041","044","54","V0262")
        temp=x[(x$A_1 %in% I)|(x$A_2 %in% I)|(x$A_3 %in% I), ]
        H=rbind(H,temp)
    }
}

each files structuration are same like this

> str(x)
'data.frame':   417691 obs. of  37 variables:
$ YM: int  199604 199612 199612 199612 199606 199606 199609 199601 ...
$ A_TYPE: int  1 1 1 1 1 1 1 1 1 1 ...
$ HOSP: chr  "dd0516ed3e" "c53d67027e" ...
$ A_DATE: int  19960505 19970116 19970108  ...
$ C_TYPE: int  19 9 1 1 2 9 9 1 1 1 ...
$ S_NO : int  142 37974 4580 4579 833 6846 2272 667 447 211 ...
$ C_ITEM_1 : chr  "P2" "P3" "A2"...
$ C_ITEM_2 : chr  "R6" "I3" ""...
$ C_ITEM_3 : chr  "W2" "" "A2"...
$ C_ITEM_4 : chr  "Y1" "O3" ""...
$ F_TYPE: chr  "40" "02" "02" "02" ...
$ F_DATE : int  19960415 19961223 19961227  ...
$ T_END_DATE: int  NA NA NA  ...
$ ID_B : int  19630526 19630526 19630526  ...
$ ID : chr  "fff" "fac" "eab"...
$ CAR_NO : chr  "B4" "B5" "C1" "B6" ...
$ GE_KI: int  4 4 4 4 4 4 4 4 4 4 ...
$ PT_N : chr  "H10" "A10" "D10" "D10" ...
$ A_1  : chr  "0521" "7948" "A310" "A312" ...
$ A_2  : chr  "05235" "5354" "" "" ...
$ A_3  : chr  "" "" "" "" ...
$ I_O_CE: chr  "5210" "" "" "" ...
$ DR_DAY : int  0 7 3 3 0 0 3 3 3 3 ...
$ M_TYPE: int  2 0 0 0 2 2 0 0 0 0 ...

........

Upvotes: 3

Views: 995

Answers (3)

vaettchen
vaettchen

Reputation: 7659

This may not be the most efficient or most elegant approach, but here is what I would do, based upon some assumptions where more info is missing; particularly, can't do any testing:

Make sure that RSQLite is installed (sqldf could be an option if you have enough memory, but personally I prefer having a "real" database that I also can access with other tools).

# make sqlite available
library( RSQLite )
db <- dbConnect( dbDriver("SQLite"), dbname = "hannah.sqlite" )

# create a vector with your filenames
filenames <- NULL
for (i in 201:225)
{
    for ( j in 1996:2007 )
    {
        fname <- paste( "D:/Hannah/CD/CD.R", i, "_cd", j, ".csv", sep="" ) 
        filenames <- c( filenames, fname )
    }
}

# extract the DB structure, create empty table
x <- read.csv( filenames[1], stringsAsFactor = FALSE, nrows = 1 )
dbWriteTable( db, "all", x, row.names = FALSE )
dbGetQuery( db, "DELETE FROM all" )

# a small table for your selection criteria (build in flexibility for the future)
I <- as.data.frame( c( "051", "041", "044", "54", "V0262" ) )
dbWriteTable( db, "crit", I, row.names = FALSE )

# move your 300 .csv files into that table
# (you probably do that better using the sqlite CLI but more info would be needed)
for( f in filenames )
{
    x <- read.csv( f, stringsAsFactor = FALSE )
    dbWriteTable( db, "all", x, append = TRUE, row.names = FALSE )
}

# now you can extract the subset in one go
extract <- dbGetQuery( db, "SELECT * FROM all 
                       WHERE A_1 IN (SELECT I FROM crit ) OR
                             A_2 IN (SELECT I FROM crit ) OR
                             A_3 IN (SELECT I FROM crit )"   )

This is not tested but should work (if not, tell me where it stops) and it should be faster and not run into memory problems. But again, without real data no real solution!

Upvotes: 2

beginneR
beginneR

Reputation: 3291

You could also use function fread() from the data.table package. It's pretty fast compared to read.csv. Also, try to just loop over list.files().

Upvotes: 2

Paul Hiemstra
Paul Hiemstra

Reputation: 60984

I think the big performance problem here is that you iteratively grow the H object. Each time the object grows, the OS needs to allocate more for it. This process takes quite long. A simple fix would be to preallocate H to the correct number of rows. If the number of rows is not known beforehand, you can preallocate a good amount, and resize as needed.

Alternatively, the following approach does not suffer form the problem I describe above:

list_of_files = list.files('dir_where_files_are', pattern = '*csv', full.names = TRUE)
big_data_frame = do.call('rbind', lapply(list_of_files, read.csv, sep = ""))

Upvotes: 4

Related Questions