Batch reading only certain variables of many files

Question

I have one different CSV file for every month since 2010. They all have the same basic information, but recent files have more variables than the old ones. They are also arranged in different orders.

I want to batch read all the files in my directory as data.tables, select a list of variables from each read file and then compile them into a longitudinal data.table unsing rbind.

I thought of using something along the lines of:

all_files = list.files()
dbs <- lapply(all_files, data.table)

Then select a list of variables from each data.table:

object[ , c("ID", "Date", "Paid", "Pending")]
# These 4 variables have the exact same names throughout all files.

And finally, use rbind to bind all the data.tables together.

Thanks in advance!

Gautam · Accepted Answer

If the columns are named identically:

library(data.table) 
fnames <- dir(path = choose.dir(), pattern = '.*csv', full.names = T, recursive = F)
cols <- c('ID', 'Date', 'Paid', 'Pending')
data_list <- lapply(fnames, function(z){
  ret <- fread(z, select = cols)
  ret$filename <- z
  return(ret)
})

I stored filename as a column name because you may want to trace back where the data came from (for any discrepancies etc.).

To get it into a single data.table object:

dat <- rbindlist(dat_list, use.names = T, fill = T)

Batch reading only certain variables of many files

Answers (1)

Related Questions