Hengcheng Zhu
Hengcheng Zhu

Reputation: 71

Read huge csv file using `read.csv` by divide-and-conquer strategy?

I am supposed to read a big csv file (5.4GB with 7m lines and 205 columns) in R. I have successfully read it by using data.table::fread(). But I want to know is it possible to read it by using the basic read.csv()?

I tried just using brute force but my 16GB RAM cannot hold that. Then I tried to use the 'divide-and-conquer' (chunking) strategy as below, but it still didn't work. How should I do this?

dt1 <- read.csv('./ss13hus.csv', header = FALSE, nrows = 721900, skip =1)
print(paste(1, 'th chunk completed'))
system.time(
  for (i in (1:9)){
    tmp = read.csv('./ss13hus.csv', header = FALSE, nrows = 721900, skip = i * 721900 + 1)
    dt1 <- rbind(dt1, tmp)
    print(paste(i + 1, 'th chunk completed'))
  }
)

Also I want to know how fread() works that could read all the data at once and very efficiently no matter in terms of memeory or time?

Upvotes: 0

Views: 500

Answers (1)

smci
smci

Reputation: 33940

Your issue is not fread(), it's the memory bloat caused from not defining colClasses for all your (205) columns. But be aware that trying to read all 5.4GB into 16GB RAM is really pushing it in the first place, you almost surely won't be able to hold all that dataset in-memory; and even if you could, you'll blow out memory whenever you try to process it. So your approach is not going to fly, you seriously have to decide which subset you can handle - which fields you absolutely need to get started:

  • Define colClasses for your 205 columns: 'integer' for integer columns, 'numeric' for double columns, 'logical' for boolean columns, 'factor' for factor columns. Otherwise things get stored very inefficiently (e.g. millions of strings are very wasteful), and the result can easily be 5-100x larger than the raw file.

  • If you can't fit all 7m rows x 205 columns, (which you almost surely can't), then you'll need to aggressively reduce memory by doing some or all of the following:

    • read in and process chunks (of rows) (use skip, nrows arguments, and search SO for questions on fread in chunks)
    • filter out all unneeded rows (e.g. you may be able to do some crude processing to form a row-index of the subset rows you care about, and import that much smaller set later)
    • drop all unneeded columns (use fread select/drop arguments (specify vectors of column names to keep or drop).
  • Make sure option stringsAsFactors=FALSE, it's a notoriously bad default in R which causes no end of memory grief.

  • Date/datetime fields are currently read as character (which is bad news for memory usage, millions of unique strings). Either totally drop date columns for beginning, or read the data in chunks and convert them with the fasttime package or standard base functions.
  • Look at the args for NA treatment. You might want to drop columns with lots of NAs, or messy unprocessed string fields, for now.

Please see ?fread and the data.table doc for syntax for the above. If you encounter a specific error, post a snippet of say 2 lines of data (head(data)), your code and the error.

Upvotes: 4

Related Questions