Reputation: 588
I have simulated data stored locally which I read into a list with lapply
and fread
and then work from the list thereafter. All files are equal in dimensions and format. I am wondering how I can improve the efficiency of reading time by using the cmd
argument of fread
.
For example, if I am interested in certain variables the select
argument comes handy. The nrows
argument is helpful when data from a specific year only is needed by first identifying the rows that include the years less than and equal to the year of interest and then filtering afterwards.
Having come across this article I have realised that the reading of files can be further optimised. However, I don't understand how to use the sprintf
function since I want to preprocess on an integer value (the year in my case) rather than a string. Maybe using lapply
creates a problem too (?).
If someone could please help on preprocessing multiple files (subsetting by an integer value and selecting desired variables) with fread
and the cmd
argument it would be greatly appreciated.
Here is a MWE of my approach:
keepRows <- min(which(df$year > 3)) - 1
all.df <- lapply(seq_along(dataFiles), function(x){
x <- fread(dataFiles[x], select = c("year", "a", "c", "e"),
nrows = keepRows) #,
x <- filter(x, year == 3)
})
Data:
df <- data.frame(year = rep(1:6, each = 3),
a = sample(20:25, 18, replace = TRUE),
b = rnorm(18, 1),
c = rbinom(18, 30, 0.25),
d = sample(letters, 18),
e = paste0(sample(letters, 18), sample(1:3, 18, replace = TRUE)))
dataFiles <- paste0("df", 1:5, ".csv")
Upvotes: 2
Views: 349
Reputation: 27762
Here is my go at things using the findstr-command in Window 10
What it does: It builds a command that lookes for all line sthat start with the string 3,
, 4,
,5,
,6,
,7,
,8,
or 9,
, and passes these results to fread.
it needs the full path to your csv-files.. relative wo'nt work.. You will also have to use gsub to get windows-readable filenames
#what files to read?
files.to.read <- list.files( path = "C:/Users/your/Documents/here/temp", pattern = "^df[1-5]\\.csv$", full.names = TRUE )
#read files, only line stat start with regex "^[3-9],"
pattern = "^[3-9],"
data.table::rbindlist(
lapply( files.to.read, function(x) {
data.table::fread( cmd = paste0( 'findstr /R "', pattern, '" ', gsub( "\\/", "\\\\", x ) ),
sep = ",",
header = FALSE )
}),
use.names = TRUE, idcol = "from_df" )
Upvotes: 1
Reputation: 56219
We can use awk within cmd (not tested):
library(data.table)
myCols <- c("year", "a", "b", "c", "d", "e")
mySelect <- c("year", "a", "c", "e")
myYear <- 3
ixCol <- match(mySelect, myCols)
all.df <- lapply(list.files("path/to/files/", pattern = "*.csv"), function(x){
myCmd <- paste0("awk -F, '$1 == ", myYear, "' ", x)
fread(cmd = myCmd, select = ixCol, col.names = myCols[ ixCol ])
})
Upvotes: 2