Reputation: 65
The fread function (data.table) enables users to define specific columns of dataframe to read in using the 'select' argument (e.g. fread(input, select=c(1,5,10)). I would like the same ability, but for rows (e.g. fread(input,selectrows=c(1,4,47)). I could do this after reading in the files, but that takes a very long time and I hope to optimize the process by only reading in the rows I need.
I am aware of a number of options for selecting rows programmatically based on 'within-file' criteria:
Read csv file with selected rows using data.table's fread
Quickest way to read a subset of rows of a CSV
...but I want to be able to use a vector defined based on criteria outside of the given file to be read in (as in this question, but specifically using fread).
Upvotes: 1
Views: 1457
Reputation: 160447
One method (although a little brute-force) is to use sed
to cut the lines.
Recall that fread
takes file=
as well as cmd=
, as in
library(data.table)
fwrite(iris, "iris.csv")
fread(cmd = "head -n 3 iris.csv")
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1: 5.1 3.5 1.4 0.2 setosa
# 2: 4.9 3.0 1.4 0.2 setosa
(Two rows since head
doesn't know/care about the header row.)
Try this:
want_rows <- c(1, 3, 147:149)
# due to the header row, add the header and 1 to each of want
paste0(c(1, 1+want_rows), "p")
# [1] "1p" "2p" "4p" "148p" "149p" "150p"
writeLines(paste0(c(1, 1+want_rows), "p"), "commands.sed")
fread(cmd = "sed -n -f commands.sed iris.csv")
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1: 5.1 3.5 1.4 0.2 setosa
# 2: 4.7 3.2 1.3 0.2 setosa
# 3: 6.3 2.5 5.0 1.9 virginica
# 4: 6.5 3.0 5.2 2.0 virginica
# 5: 6.2 3.4 5.4 2.3 virginica
iris[want_rows,]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 147 6.3 2.5 5.0 1.9 virginica
# 148 6.5 3.0 5.2 2.0 virginica
# 149 6.2 3.4 5.4 2.3 virginica
If you have significant "ranges", then you could optimize this a little for sed
, to have an effective command line of sed -ne '1p;2p;4p;148,150p'
for the same effect.
There is another method ala "every so many rows" listed here: https://www.thegeekstuff.com/2009/09/unix-sed-tutorial-printing-file-lines-using-address-and-patterns/. I don't know how tightly you can control this (every nth row starting from some arbitrary number, for instance). I don't know that this is your intent or need, though, it sounds like will have arbitrary line numbers.
Upvotes: 2