vinash85
vinash85

Reputation: 431

R fread data.table inconsistent speed

I am observing an inconsistent speed of data.table of fread function. I have to 2 files of ~8 GB size. The content of the files are (almost) same. Time to read two files are strangely different.

 control.major  <-  fread("control.major.gff")$V6
 Read 19.8% of 98100000 rows
 Read 98100000 rows and 10 (of 10) columns from 7.947 GB file in 02:06:58
 control.minor  <-  fread("control.minor.gff")$V6  
 Read 98100000 rows and 10 (of 10) columns from 7.947 GB file in 00:03:15

I have to read 6th column of the files which are all numeric. Initially I found that fread was faster compared to

 scan(pipe("cut -f6  SNP.major.gff"),  sep="\n")

Because cut function was taking awful lot of time.

Why there is inconsistent behavior of fread? Is there a faster way to read one column?

Upvotes: 5

Views: 1927

Answers (2)

Jacob H
Jacob H

Reputation: 4513

I've had a similar problem. Namely, the first time I ran fread it was very slow, however, successive runs were much faster. In my case this was due to the fact that I was working on a computer in my University's computer lab. Consequently, the data was not locally on my machine, but was on a network. This meant that most of the time spent running fread was actually represented by transferring the data across the network and into my local working memory. This was corroborated by the fact that when I timed my code on the first run, the user time + sys. time << elapsed time.

When you load the data once, however, it is temporarily in your working memory, i.e. RAM. Successive calls to fread with the same data are therefore much faster.

Upvotes: 5

Matt Dowle
Matt Dowle

Reputation: 59602

Why did it take 2 hours to load?

Please run it again with verbose=TRUE and include the full output in the question. Maybe the operating system put it in the background while something else ran, or something like that. Did your laptop suspend or hibernate in that time? Please also include the output of sessionInfo().

Is there a faster way to read one column?

Yes. You can pass a vector of column names or positions to the select argument. See ?fread. But I suspect the two issues are unrelated.

Upvotes: 6

Related Questions