Reputation: 71
I am supposed to read a big csv file (5.4GB with 7m lines and 205 columns) in R. I have successfully read it by using data.table::fread()
. But I want to know is it possible to read it by using the basic read.csv()
?
I tried just using brute force but my 16GB RAM cannot hold that. Then I tried to use the 'divide-and-conquer' (chunking) strategy as below, but it still didn't work. How should I do this?
dt1 <- read.csv('./ss13hus.csv', header = FALSE, nrows = 721900, skip =1)
print(paste(1, 'th chunk completed'))
system.time(
for (i in (1:9)){
tmp = read.csv('./ss13hus.csv', header = FALSE, nrows = 721900, skip = i * 721900 + 1)
dt1 <- rbind(dt1, tmp)
print(paste(i + 1, 'th chunk completed'))
}
)
Also I want to know how fread()
works that could read all the data at once and very efficiently no matter in terms of memeory or time?
Upvotes: 0
Views: 500
Reputation: 33940
Your issue is not fread()
, it's the memory bloat caused from not defining colClasses for all your (205) columns. But be aware that trying to read all 5.4GB into 16GB RAM is really pushing it in the first place, you almost surely won't be able to hold all that dataset in-memory; and even if you could, you'll blow out memory whenever you try to process it. So your approach is not going to fly, you seriously have to decide which subset you can handle - which fields you absolutely need to get started:
Define colClasses for your 205 columns: 'integer' for integer columns, 'numeric' for double columns, 'logical' for boolean columns, 'factor' for factor columns. Otherwise things get stored very inefficiently (e.g. millions of strings are very wasteful), and the result can easily be 5-100x larger than the raw file.
If you can't fit all 7m rows x 205 columns, (which you almost surely can't), then you'll need to aggressively reduce memory by doing some or all of the following:
skip, nrows
arguments, and search SO for questions on fread in chunks)select
/drop
arguments (specify vectors of column names to keep or drop).Make sure option stringsAsFactors=FALSE
, it's a notoriously bad default in R which causes no end of memory grief.
fasttime
package or standard base functions.Please see ?fread
and the data.table
doc for syntax for the above. If you encounter a specific error, post a snippet of say 2 lines of data (head(data)
), your code and the error.
Upvotes: 4