Reputation: 221
My datasource are 2GB file composed of 3 columns : key1, key2, result. These files are composed of more than 10 millions of lines each. I am trying to access the i-th line of the j-th file as fast as possible from R, on a Windows system. Fread-ing the file is not an option as it takes more than 2 min, which is not possible in this use-case.
I have tried several methods but all failed so far :
Methodology 1: readlines, read.table or fread
readlines(file("myFile.csv", "r"),n=1, skip = M)
read.table("myFile.csv", skip=M, nrows=1)
fread("myFile.csv", skip=M, nrows=1)
The problem with these methodologies is that when M tends to the end of the file, the access time is very long. I would be very interesed if a similar solution can be found
Methodology 2: fst package
As suggested in the comment / answers section, the fst package works fine thanks to the From
/ To
arguments, yet it requires the duplication of the CSVs under a FST format, which is painful to maintain when reading into thousands of CSV files. A "csv-based" solution would be even better.
Methodology 3: SQLite DB
library(RSQLite)
library(data.table)
db <- dbConnect(SQLite(), "NEW_DB")
dbWriteTable(db, "chocs", fread("myFile.csv"), append = TRUE, row.names = FALSE)
dbGetQuery(db, "SELECT * FROM chocs WHERE V1 = 1 AND V2 = 1")
The problem with this method is that it creates SQL DB whereas the CSV is already present, and the queries are not as fast as with the FST library proposed in methodology 2
Is there any fast methodology ?
Upvotes: 1
Views: 370
Reputation: 789
how about fread
from the data.table
package? The row selection could look something like this:
dt[.(1,1), on = .(V1, V2)]
Upvotes: 0
Reputation: 7630
fst
is nice, as I see @TimSalabim mentioned in the comments , but you have to read it in first, then write it as a fst file. If you need to read arbitrary single lines multiple times in different sessions, then I would say it's reasonable to spend that 2 minutes reading it in one time as a data table with fread
and then writing it out as a fst file.
The example with iris
.
library(fst)
library(data.table)
DT <- data.table(iris)
write_fst(DT, "iris.fst")
x <- read_fst("iris.fst", from = 111, to = 111)
x
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 6.5 3.2 5.1 2 virginica
# exit your session, come back tomorrow
library(fst)
x <- read_fst("iris.fst", from = 99, to = 99)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 2.5 3 1.1 versicolor
For you, of course, you'd start with
DT<- fread("myFile.csv")
write_fst(DT, "myFile.fst")
x <- read_fst("myFile.fst", from = 1e7, to = 1e7)
I'm on a dinky machine right now with limited memory, so I'm not going to create a 2GB object and save it to test, but the iris example works and once you've spent the startup cost of creating the file, your line read should be substantially faster. Presumably you're on a decent machine with openMP support and an SSD.
Upvotes: 2