fwzzjd
fwzzjd

Reputation: 13

Are there faster ways in data.table fread() to read large gz file into memory?

I am trying to read back a large gz file (about 1.5 Gb, with two columns: 1st column is numeric and 2nd column is character) named CID-Title.gz from my computer disk (originally, I downloaded this file from here

I used data.table::fread() as below:

dat <- data.table::fread(file = "CID-Title.gz", showProgress = TRUE, nThread = 16)

How can I make this fread faster when reading this gz file? Or is there other trick to read the large gz file with faster speed?

Upvotes: 1

Views: 1660

Answers (1)

Ben Bolker
Ben Bolker

Reputation: 226182

You can try the vroom package (web page here, CRAN page here). It reads to a tidyverse tibble: I don't know if there is a downstream performance penalty for converting the tibble to a data.table object.

vroom doesn’t stop to actually read all of your data, it simply indexes where each record is located so it can be read later. The vectors returned use the Altrep framework to lazily load the data on-demand when it is accessed, so you only pay for what you use. This lazy access is done automatically, so no changes to your R data-manipulation code are needed.

vroom also uses multiple threads for indexing, materializing non-character columns, and when writing to further improve performance.

vroom supports reading zip, gz, bz2 and xz compressed files automatically, just pass the filename of the compressed file to vroom.

From the vroom benchmarks vignette:

bar plot of vroom benchmarks

Upvotes: 2

Related Questions