Dylan Russell
Dylan Russell

Reputation: 1098

R Combine very large list of data tables into one data.table

I have a very large list of 13 data.tables (~11.1 Gb total). I have 16Gb of RAM. With the list loaded into memory I have 5 GB of RAM left.

I need to combine these into a single data.table. I would prefer to use data.table::rbindlist if possible due to its fill = TRUE parameter (some of my data.tables have columns the others do not have - I need those filled with NA).

The problem is this is requiring more than 5 GB of RAM to complete and I am unable to combine the list. It would seem like I already have the data loaded into memory and a combined data.table wouldn't be any bigger. I just need to figure out if there's a way to complete the operation without having to copy the entire list into memory (occupying 22GB of RAM) to perform the rbindlist.

In order to make this list in the first place, I am running an lapply that looks like:

  df <- lapply(fs::dir_ls(dir), function(file) {
     clean_data(file)
  })

I am taking a list of .csv files and converting them to clean data.tables via lapply, which is how I end up with a list.

purrr::map_dfr does not seem to work and neither does just wrapping the lapply in rbindlist.

Upvotes: 2

Views: 1464

Answers (2)

dww
dww

Reputation: 31452

Here's an R data.table approach. Similar in concept to @r2evans answer, but using only R, we can start by concatenating all the files into a single csv with all the columns filled in:

First read the column names of all files and create a vector of the unique names

library(data.table)
fnames = list.files(pattern = 'dt.+csv')
fcols = lapply(fnames, fread, nrows=0)
fcols = sapply(fcols, names)
fcols = unique(as.vector(fcols))

Now add each file's data into a single csv, with missing values replaced with NA. Note that the remove(fdt); gc() lines maybe not needed, as fdt gets reassigned and R should handle the memory management of this ok for you. I added them just to make sure that any no longer used memory is freed regulalrly.

for (f in fnames) {
  fdt = fread(f)
  fdt[, (setdiff(fcols, names(fdt))) := NA]
  fwrite(fdt[, .SD, .SDcols = fcols], 'all.csv', append = T)
  remove(fdt)
  gc()
}

Then we just read in a single file

fdt = fread('all.csv')

some dummy reproducible data files

set.seed(1)
dt1 = data.table()[, (sample(letters[1:5],3)) := sample(10)]
dt2 = data.table()[, (sample(letters[1:5],3)) := sample(10)]
dt3 = data.table()[, (sample(letters[1:5],3)) := sample(10)]
dt4 = data.table()[, (sample(letters[1:5],3)) := sample(10)]
fwrite(dt1, 'dt1.csv')
fwrite(dt2, 'dt2.csv')
fwrite(dt3, 'dt3.csv')
fwrite(dt4, 'dt4.csv')
remove(dt1, dt2, dt3, dt4)

Upvotes: 1

r2evans
r2evans

Reputation: 160447

There might be an R-only way of doing this, but one efficient way would be to do this with command-line (not R) tools.

Setup for the sake of this answer:

mt1 <- mtcars[1:3,c(1,2,3)]
mt2 <- mtcars[3:4,c(1,2,4)]
mt3 <- mtcars[5:10,c(1,3,4)]

combined <- rbindlist(list(mt1, mt2, mt3), use.names = TRUE, fill = TRUE)
combined
#      mpg cyl  disp  hp
#  1: 21.0   6 160.0  NA
#  2: 21.0   6 160.0  NA
#  3: 22.8   4 108.0  NA
#  4: 22.8   4    NA  93
#  5: 21.4   6    NA 110
#  6: 18.7  NA 360.0 175
#  7: 18.1  NA 225.0 105
#  8: 14.3  NA 360.0 245
#  9: 24.4  NA 146.7  62
# 10: 22.8  NA 140.8  95
# 11: 19.2  NA 167.6 123

write.table(mt1, "mt1.tsv", row.names = FALSE)
write.table(mt2, "mt2.tsv", row.names = FALSE)
write.table(mt3, "mt3.tsv", row.names = FALSE)

Now that we know what the data looks like, let's get the filenames programmatically:

filenames <- list.files(".", pattern = "^mt.*\\.tsv", full.names = TRUE)
filenames
# [1] "./mt1.tsv" "./mt2.tsv" "./mt3.tsv"

From here, let's grab the first 1 row from each file (quickly/efficiently, since it's only 1 row each) and rbindlist them so that we know what the resulting table should look like. Granted, we don't need to keep any of the actual values, just the columns.

row1s <- rbindlist(lapply(filenames, function(fn) fread(fn, nrows = 1)),
                   use.names = TRUE, fill = TRUE)[0,]
row1s
# Empty data.table (0 rows and 4 cols): mpg,cyl,disp,hp

For demonstration here, note that merging this 0-row table with one of the originals renders a consistent schema. (No need to do this with real data unless you want to verify one or two.)

row1s[mt1, on = intersect(names(row1s), names(mt1))]
#     mpg cyl disp hp
# 1: 21.0   6  160 NA
# 2: 21.0   6  160 NA
# 3: 22.8   4  108 NA
row1s[mt2, on = intersect(names(row1s), names(mt2))]
#     mpg cyl disp  hp
# 1: 22.8   4   NA  93
# 2: 21.4   6   NA 110

The goal will be to do this for all files programmatically:

# iterate through each file: read, left-join, write
for (fn in filenames) {
  dat <- fread(fn)
  dat <- row1s[dat, on = intersect(names(row1s), names(dat))]
  fwrite(dat, file.path(dirname(fn), paste0("augm_", basename(fn))), sep = "\t")
}

newfilenames <- list.files(".", pattern = "^augm_mt.*\\.tsv$", full.names = TRUE)
newfilenames
# [1] "./augm_mt1.tsv" "./augm_mt2.tsv" "./augm_mt3.tsv"

To verify the new files look consistent, look for the double-\t (which indicate empty data, i.e. NA on import):

# double-\t indicates an empty field
lapply(newfilenames, readLines, n = 2)
# [[1]]
# [1] "mpg\tcyl\tdisp\thp" "21\t6\t160\t"      
# [[2]]
# [1] "mpg\tcyl\tdisp\thp" "22.8\t4\t\t93"     
# [[3]]
# [1] "mpg\tcyl\tdisp\thp" "18.7\t\t360\t175"  

Now that we have this, let's go to a command-prompt (on windows, git-bash or just windows' bash if you need it). We need bash and one of tail or grep. The goal is that the want the column headers from one of these augm_mt files but none of the others.

If we naively concatenate the files, we'll see the header row repeated in the middle of the data ... and with R, this means that every column will be character, likely not what you want:

$ cat augm_mt1.tsv augm_mt2.tsv
mpg     cyl     disp    hp
21      6       160
21      6       160
22.8    4       108
mpg     cyl     disp    hp
22.8    4               93
21.4    6               110

Three options to side-step this, depending on what tools you have and how much you trust the contents of your data. (I suggest number 1, tail, if you have it, as it is the least-ambiguous.)

  1. If you have tail, then we can "start at line 2" (skipping line 1) for each file:

    $ cat augm_mt2.tsv
    mpg     cyl     disp    hp
    22.8    4               93
    21.4    6               110
    
    $ tail -n +2 augm_mt2.tsv
    22.8    4               93
    21.4    6               110
    

    If you run this on multiple files, it tends to prepend the filename with each set of tailed lines (try it), we'll suppress that in favor of contiguous lines by adding the -q option.

  2. If you know that one or more of the column names are never seen in the real contents, then you can do one of:

    $ grep -v mpg augm_mt2.tsv
    22.8    4               93
    21.4    6               110
    
    $ grep -v 'mpg.*cyl.*disp' augm_mt3.tsv
    18.7            360     175
    18.1            225     105
    14.3            360     245
    24.4            146.7   62
    22.8            140.8   95
    19.2            167.6   123
    
  3. More complicated but should be safer than "hand-written regex" as in number 2:

    $ HDR=$(head -n 1 augm_mt2.tsv)
    $ grep -F "$HDR" -v augm_mt2.tsv
    22.8    4               93
    21.4    6               110
    

    (The -F means "fixed string", so no regex matching is attempted. This is safest, since something like a period in a column name could pose a potential risk. Remote, but non-zero.)

Whichever way you choose to go, this is how you combine those three files into one mega-file for reading back into R:

$ { head -n 1 augm_mt1.tsv ; tail -q -n +2 augm_*.tsv ; } > alldata_mt.tsv

The head -n 1 merely outputs the header row only, no data, making it much easier to do augm_*.tsv in the next command. (Otherwise we would need to find a way to do everything do all-but-the-first.)

Now we can read this back into R in one command:

fread("alldata_mt.tsv")
#      mpg cyl  disp  hp
#  1: 21.0   6 160.0  NA
#  2: 21.0   6 160.0  NA
#  3: 22.8   4 108.0  NA
#  4: 22.8   4    NA  93
#  5: 21.4   6    NA 110
#  6: 18.7  NA 360.0 175
#  7: 18.1  NA 225.0 105
#  8: 14.3  NA 360.0 245
#  9: 24.4  NA 146.7  62
# 10: 22.8  NA 140.8  95
# 11: 19.2  NA 167.6 123

And for validation with this micro-data:

all.equal(fread("alldata_mt.tsv"), combined)
# [1] TRUE

Alternative that leaves intermediate files with no column headers, so we don't have to dance around things:

for (fn in filenames) {
  dat <- fread(fn)
  dat <- row1s[dat, on = intersect(names(row1s), names(dat))]
  fwrite(dat, file.path(dirname(fn), paste0("augm_", basename(fn))), sep = "\t", col.names = FALSE)
}

Then in bash:

$ cat augm_*tsv > alldata2_mt.tsv

Then in R again,

fread("alldata2_mt.tsv", header = FALSE)
#       V1 V2    V3  V4
#  1: 21.0  6 160.0  NA
#  2: 21.0  6 160.0  NA
#  3: 22.8  4 108.0  NA
#  4: 22.8  4    NA  93
#  5: 21.4  6    NA 110
#  6: 18.7 NA 360.0 175
#  7: 18.1 NA 225.0 105
#  8: 14.3 NA 360.0 245
#  9: 24.4 NA 146.7  62
# 10: 22.8 NA 140.8  95
# 11: 19.2 NA 167.6 123

... and you'll have to know the names to reassign them. This method seems like a little less work (it is), but it does leave the remote possibility that the order of column names is inadvertently changed. The first method above that preserves column names in all of the files preempts that potential mis-step.

Upvotes: 3

Related Questions