mlegge
mlegge

Reputation: 6913

Suppress wc -l in data.table's fread

I am reading in chunks if a large file (~30 GB) and have noticed that most of time is taken by performing a line count on the entire file.

Read 500000 rows and 49 (of 49) columns from 28.250 GB file in 00:01:09
   4.510s (  7%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
  53.890s ( 79%) Count rows (wc -l)
   0.010s (  0%) Column type detection (first, middle and last 5 rows)
   0.120s (  0%) Allocation of 500000x49 result (xMB) in RAM
   9.780s ( 14%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.060s (  0%) Changing na.strings to NA
  68.370s        Total

Is it possible to specify that fread not do a full rowcount every time I read a chunk or is this a necessary step?

EDIT: Here is the exact command I am running:

fread(pfile, skip = 5E6, nrows = 5E5, sep = "\t", colClasses = rpColClasses, na.strings = c("NA", "N/A", "NULL"), head = FALSE, verbose = TRUE)

Upvotes: 4

Views: 266

Answers (1)

Jacob H
Jacob H

Reputation: 4513

I'm not sure if you can "turn off" the wc -l command in fread. That withstanding I do have two answers for you.

Answer 1: Use the Unix command split to break the large data set into chunks before calling fread. I find that knowing a bit of Unix goes a long way when handling big data sets (i.e. data that does not fit into RAM).

split -b 1m myfile.csv #breaks your file into 1mb chunks. 

Answer 2: Using connections. This approach unfortunately does not work with fread. Check out my previous post to understand what I mean by using connections. Strategies for reading in CSV files in pieces?

Upvotes: 2

Related Questions