Are there faster ways in data.table fread() to read large gz file into memory?

Question

I am trying to read back a large gz file (about 1.5 Gb, with two columns: 1st column is numeric and 2nd column is character) named CID-Title.gz from my computer disk (originally, I downloaded this file from here

I used data.table::fread() as below:

dat <- data.table::fread(file = "CID-Title.gz", showProgress = TRUE, nThread = 16)

How can I make this fread faster when reading this gz file? Or is there other trick to read the large gz file with faster speed?

Ben Bolker · Accepted Answer

You can try the vroom package (web page here, CRAN page here). It reads to a tidyverse tibble: I don't know if there is a downstream performance penalty for converting the tibble to a data.table object.

vroom doesn’t stop to actually read all of your data, it simply indexes where each record is located so it can be read later. The vectors returned use the Altrep framework to lazily load the data on-demand when it is accessed, so you only pay for what you use. This lazy access is done automatically, so no changes to your R data-manipulation code are needed.

vroom also uses multiple threads for indexing, materializing non-character columns, and when writing to further improve performance.

vroom supports reading zip, gz, bz2 and xz compressed files automatically, just pass the filename of the compressed file to vroom.

From the vroom benchmarks vignette:

Are there faster ways in data.table fread() to read large gz file into memory?

Answers (1)

Related Questions