Nielsou Akbrg
Nielsou Akbrg

Reputation: 221

R : Fast access to a given line of a big CSV file

My datasource are 2GB file composed of 3 columns : key1, key2, result. These files are composed of more than 10 millions of lines each. I am trying to access the i-th line of the j-th file as fast as possible from R, on a Windows system. Fread-ing the file is not an option as it takes more than 2 min, which is not possible in this use-case.

I have tried several methods but all failed so far :

Methodology 1: readlines, read.table or fread

readlines(file("myFile.csv", "r"),n=1, skip = M)

read.table("myFile.csv", skip=M, nrows=1)

fread("myFile.csv", skip=M, nrows=1)

The problem with these methodologies is that when M tends to the end of the file, the access time is very long. I would be very interesed if a similar solution can be found

Methodology 2: fst package

As suggested in the comment / answers section, the fst package works fine thanks to the From / To arguments, yet it requires the duplication of the CSVs under a FST format, which is painful to maintain when reading into thousands of CSV files. A "csv-based" solution would be even better.

Methodology 3: SQLite DB

library(RSQLite)
library(data.table)
db <- dbConnect(SQLite(), "NEW_DB")
dbWriteTable(db, "chocs", fread("myFile.csv"), append = TRUE, row.names = FALSE)
dbGetQuery(db, "SELECT * FROM chocs WHERE V1 = 1 AND V2 = 1")

The problem with this method is that it creates SQL DB whereas the CSV is already present, and the queries are not as fast as with the FST library proposed in methodology 2

Is there any fast methodology ?

Upvotes: 1

Views: 370

Answers (2)

bendae
bendae

Reputation: 789

how about fread from the data.table package? The row selection could look something like this:

dt[.(1,1), on = .(V1, V2)]

Upvotes: 0

De Novo
De Novo

Reputation: 7630

fst is nice, as I see @TimSalabim mentioned in the comments , but you have to read it in first, then write it as a fst file. If you need to read arbitrary single lines multiple times in different sessions, then I would say it's reasonable to spend that 2 minutes reading it in one time as a data table with fread and then writing it out as a fst file.

The example with iris.

library(fst)
library(data.table)
DT <- data.table(iris)
write_fst(DT, "iris.fst")
x <- read_fst("iris.fst", from = 111, to = 111)
x
#  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
# 1          6.5         3.2          5.1           2 virginica

# exit your session, come back tomorrow
library(fst)
x <- read_fst("iris.fst", from = 99, to = 99)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
# 1          5.1         2.5            3         1.1 versicolor

For you, of course, you'd start with

DT<- fread("myFile.csv")
write_fst(DT, "myFile.fst")
x <- read_fst("myFile.fst", from = 1e7, to = 1e7)

I'm on a dinky machine right now with limited memory, so I'm not going to create a 2GB object and save it to test, but the iris example works and once you've spent the startup cost of creating the file, your line read should be substantially faster. Presumably you're on a decent machine with openMP support and an SSD.

Upvotes: 2

Related Questions