Reputation: 588

How to pre-process .csv files on integer values with cmd and fread from data.table

I have simulated data stored locally which I read into a list with lapply and fread and then work from the list thereafter. All files are equal in dimensions and format. I am wondering how I can improve the efficiency of reading time by using the cmd argument of fread.

For example, if I am interested in certain variables the select argument comes handy. The nrows argument is helpful when data from a specific year only is needed by first identifying the rows that include the years less than and equal to the year of interest and then filtering afterwards.

Having come across this article I have realised that the reading of files can be further optimised. However, I don't understand how to use the sprintf function since I want to preprocess on an integer value (the year in my case) rather than a string. Maybe using lapply creates a problem too (?).

If someone could please help on preprocessing multiple files (subsetting by an integer value and selecting desired variables) with fread and the cmd argument it would be greatly appreciated.

Here is a MWE of my approach:

keepRows <- min(which(df$year > 3)) - 1
all.df <- lapply(seq_along(dataFiles), function(x){
  x <- fread(dataFiles[x], select = c("year", "a", "c", "e"),
             nrows = keepRows) #,
  x <- filter(x, year == 3)
})

Data:

df <- data.frame(year = rep(1:6, each = 3),
                 a = sample(20:25, 18, replace = TRUE),
                 b = rnorm(18, 1),
                 c = rbinom(18, 30, 0.25),
                 d = sample(letters, 18),
                 e = paste0(sample(letters, 18), sample(1:3, 18, replace = TRUE)))

dataFiles <- paste0("df", 1:5, ".csv")

Upvotes: 2

Answers (2)

Wimpel

Reputation: 27762

Here is my go at things using the findstr-command in Window 10

What it does: It builds a command that lookes for all line sthat start with the string 3,, 4,,5,,6,,7,,8, or 9,, and passes these results to fread.

it needs the full path to your csv-files.. relative wo'nt work.. You will also have to use gsub to get windows-readable filenames

#what files to read?
files.to.read <- list.files( path = "C:/Users/your/Documents/here/temp", pattern = "^df[1-5]\\.csv$", full.names = TRUE )
#read files, only line stat start with regex "^[3-9],"
pattern = "^[3-9],"
data.table::rbindlist(
  lapply( files.to.read, function(x) {
        data.table::fread( cmd =  paste0( 'findstr /R "', pattern, '" ',  gsub( "\\/", "\\\\", x ) ), 
                           sep = ",", 
                           header = FALSE )
  }),
  use.names = TRUE, idcol = "from_df" )

Upvotes: 1

zx8754

Reputation: 56219

We can use awk within cmd (not tested):

library(data.table)

myCols <- c("year", "a", "b", "c", "d", "e")
mySelect <- c("year", "a", "c", "e")
myYear <- 3
ixCol <- match(mySelect, myCols)

all.df <- lapply(list.files("path/to/files/", pattern = "*.csv"), function(x){
  myCmd <- paste0("awk -F, '$1 == ", myYear, "' ", x)
  fread(cmd = myCmd, select = ixCol, col.names = myCols[ ixCol ])
})

Upvotes: 2

How to pre-process .csv files on integer values with cmd and fread from data.table

Answers (2)

Related Questions