Ed9012
Ed9012

Reputation: 103

R - apply data.file function on each file in a folder and export them

I'm looking for some help with the following. I have many files in a folder, each of them is a txt file containing 16 columns like this:

head(a1)
v1 v2 ... v16
2.0742 1.1520 ... 5.6852
-1.4071 1.1848 ... 2.7629

which I want to transform into a single long column, using the library data.table :

library(data.table)    
setDT(a1)
a1<-melt(a1)[, .(value)]
v1
2.0742
-1.4071
...
2.7629

What I want to do is automate with a for loop reading each file in the folder, applying the function melt and exporting into another folder the transformed files. Any idea from where to start?

Upvotes: 0

Views: 165

Answers (1)

Uwe
Uwe

Reputation: 42544

According to OP's comments, there are 2 directories with 50 files with 7000 rows and 16 columns each. Assuming all columns are of type double which require 8 Bytes each, the total data volume is somewhat around 100 MBytes which can be stored and processed in memory.

So, my suggestion is to read all data in one go and combine and process it in one large data.table in memory.

Here is what I would do using my preferred tools:

library(data.table)
library(magrittr)
file_names <- list.files(test_dir, full.names = TRUE)
all_wide <- lapply(file_names, fread) %>% 
  set_names(basename(file_names)) %>% 
  rbindlist(idcol = "file_name")
all_long <- melt(all_wide, id.vars = "file_name")
all_long
           file_name variable     value
              <char>   <fctr>     <num>
      1: File001.txt       V1  101.0000
      2: File001.txt       V1  101.0000
      3: File001.txt       V1  101.0000
      4: File001.txt       V1  101.0000
      5: File001.txt       V1  101.0001
     ---                               
5599996: File050.txt      V16 5016.0700
5599997: File050.txt      V16 5016.0700
5599998: File050.txt      V16 5016.0700
5599999: File050.txt      V16 5016.0700
5600000: File050.txt      V16 5016.0700

This processes all files in directory test_dir.

Memory consumption can be displayed by

tables()
       NAME      NROW NCOL  MB                         COLS KEY
1: all_long 5,600,000    3 107     file_name,variable,value    
2: all_wide   350,000   17  45 file_name,V1,V2,V3,V4,V5,...    
3:        d     7,000   16   1        V1,V2,V3,V4,V5,V6,...    
Total: 153MB

The source of each row can be identified by file_name.

Data for testing

Warning: The code below will create a subdirectory and nfil files in the TMPDIR directory.

library(data.table)
nfil <- 50 # number of files
nrow <- 7000 # number of rows per file
ncol <- 16 # number of columns
test_dir <- file.path(tempdir(), paste0("files_in_", as.integer(Sys.time())))
print(test_dir)
dir.create(test_dir)
for (ifil in seq(nfil)) {
  d <- data.table()
  for (icol in seq(ncol)) set(d, , paste0("V", icol), ifil * 100 + icol + seq(nrow)/10^(ceiling(log10(nrow))+1))
  fwrite(d, file.path(test_dir, sprintf("File%03i.txt", ifil)))
  print(d)
}
dir(test_dir)

Upvotes: 1

Related Questions