Read specific, non-consecutive rows using data.table::fread (equivalent to the “Select” argument, but for rows)?

The fread function (data.table) enables users to define specific columns of dataframe to read in using the 'select' argument (e.g. fread(input, select=c(1,5,10)). I would like the same ability, but for rows (e.g. fread(input,selectrows=c(1,4,47)). I could do this after reading in the files, but that takes a very long time and I hope to optimize the process by only reading in the rows I need.

I am aware of a number of options for selecting rows programmatically based on 'within-file' criteria:

Read csv file with selected rows using data.table's fread

Quickest way to read a subset of rows of a CSV

...but I want to be able to use a vector defined based on criteria outside of the given file to be read in (as in this question, but specifically using fread).

Upvotes: 1

Views: 1457

Answers (1)

r2evans
r2evans

Reputation: 160447

One method (although a little brute-force) is to use sed to cut the lines.

Recall that fread takes file= as well as cmd=, as in

library(data.table)
fwrite(iris, "iris.csv")
fread(cmd = "head -n 3 iris.csv")
#    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1:          5.1         3.5          1.4         0.2  setosa
# 2:          4.9         3.0          1.4         0.2  setosa

(Two rows since head doesn't know/care about the header row.)

Try this:

want_rows <- c(1, 3, 147:149)
# due to the header row, add the header and 1 to each of want
paste0(c(1, 1+want_rows), "p")
# [1] "1p"   "2p"   "4p"   "148p" "149p" "150p"
writeLines(paste0(c(1, 1+want_rows), "p"), "commands.sed")

fread(cmd = "sed -n -f commands.sed iris.csv")
#    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
# 1:          5.1         3.5          1.4         0.2    setosa
# 2:          4.7         3.2          1.3         0.2    setosa
# 3:          6.3         2.5          5.0         1.9 virginica
# 4:          6.5         3.0          5.2         2.0 virginica
# 5:          6.2         3.4          5.4         2.3 virginica
iris[want_rows,]
#     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
# 1            5.1         3.5          1.4         0.2    setosa
# 3            4.7         3.2          1.3         0.2    setosa
# 147          6.3         2.5          5.0         1.9 virginica
# 148          6.5         3.0          5.2         2.0 virginica
# 149          6.2         3.4          5.4         2.3 virginica

If you have significant "ranges", then you could optimize this a little for sed, to have an effective command line of sed -ne '1p;2p;4p;148,150p' for the same effect.

There is another method ala "every so many rows" listed here: https://www.thegeekstuff.com/2009/09/unix-sed-tutorial-printing-file-lines-using-address-and-patterns/. I don't know how tightly you can control this (every nth row starting from some arbitrary number, for instance). I don't know that this is your intent or need, though, it sounds like will have arbitrary line numbers.

Upvotes: 2

Related Questions