Reputation: 1098
I have a very large list of 13 data.tables (~11.1 Gb total). I have 16Gb of RAM. With the list loaded into memory I have 5 GB of RAM left.
I need to combine these into a single data.table
. I would prefer to use data.table::rbindlist
if possible due to its fill = TRUE
parameter (some of my data.tables have columns the others do not have - I need those filled with NA).
The problem is this is requiring more than 5 GB of RAM to complete and I am unable to combine the list. It would seem like I already have the data loaded into memory and a combined data.table
wouldn't be any bigger. I just need to figure out if there's a way to complete the operation without having to copy the entire list into memory (occupying 22GB of RAM) to perform the rbindlist
.
In order to make this list in the first place, I am running an lapply
that looks like:
df <- lapply(fs::dir_ls(dir), function(file) {
clean_data(file)
})
I am taking a list of .csv
files and converting them to clean data.tables via lapply
, which is how I end up with a list.
purrr::map_dfr
does not seem to work and neither does just wrapping the lapply
in rbindlist
.
Upvotes: 2
Views: 1464
Reputation: 31452
Here's an R data.table approach. Similar in concept to @r2evans answer, but using only R, we can start by concatenating all the files into a single csv with all the columns filled in:
First read the column names of all files and create a vector of the unique names
library(data.table)
fnames = list.files(pattern = 'dt.+csv')
fcols = lapply(fnames, fread, nrows=0)
fcols = sapply(fcols, names)
fcols = unique(as.vector(fcols))
Now add each file's data into a single csv, with missing values replaced with NA. Note that the remove(fdt); gc()
lines maybe not needed, as fdt gets reassigned and R should handle the memory management of this ok for you. I added them just to make sure that any no longer used memory is freed regulalrly.
for (f in fnames) {
fdt = fread(f)
fdt[, (setdiff(fcols, names(fdt))) := NA]
fwrite(fdt[, .SD, .SDcols = fcols], 'all.csv', append = T)
remove(fdt)
gc()
}
Then we just read in a single file
fdt = fread('all.csv')
set.seed(1)
dt1 = data.table()[, (sample(letters[1:5],3)) := sample(10)]
dt2 = data.table()[, (sample(letters[1:5],3)) := sample(10)]
dt3 = data.table()[, (sample(letters[1:5],3)) := sample(10)]
dt4 = data.table()[, (sample(letters[1:5],3)) := sample(10)]
fwrite(dt1, 'dt1.csv')
fwrite(dt2, 'dt2.csv')
fwrite(dt3, 'dt3.csv')
fwrite(dt4, 'dt4.csv')
remove(dt1, dt2, dt3, dt4)
Upvotes: 1
Reputation: 160447
There might be an R-only way of doing this, but one efficient way would be to do this with command-line (not R) tools.
Setup for the sake of this answer:
mt1 <- mtcars[1:3,c(1,2,3)]
mt2 <- mtcars[3:4,c(1,2,4)]
mt3 <- mtcars[5:10,c(1,3,4)]
combined <- rbindlist(list(mt1, mt2, mt3), use.names = TRUE, fill = TRUE)
combined
# mpg cyl disp hp
# 1: 21.0 6 160.0 NA
# 2: 21.0 6 160.0 NA
# 3: 22.8 4 108.0 NA
# 4: 22.8 4 NA 93
# 5: 21.4 6 NA 110
# 6: 18.7 NA 360.0 175
# 7: 18.1 NA 225.0 105
# 8: 14.3 NA 360.0 245
# 9: 24.4 NA 146.7 62
# 10: 22.8 NA 140.8 95
# 11: 19.2 NA 167.6 123
write.table(mt1, "mt1.tsv", row.names = FALSE)
write.table(mt2, "mt2.tsv", row.names = FALSE)
write.table(mt3, "mt3.tsv", row.names = FALSE)
Now that we know what the data looks like, let's get the filenames programmatically:
filenames <- list.files(".", pattern = "^mt.*\\.tsv", full.names = TRUE)
filenames
# [1] "./mt1.tsv" "./mt2.tsv" "./mt3.tsv"
From here, let's grab the first 1 row from each file (quickly/efficiently, since it's only 1 row each) and rbindlist
them so that we know what the resulting table should look like. Granted, we don't need to keep any of the actual values, just the columns.
row1s <- rbindlist(lapply(filenames, function(fn) fread(fn, nrows = 1)),
use.names = TRUE, fill = TRUE)[0,]
row1s
# Empty data.table (0 rows and 4 cols): mpg,cyl,disp,hp
For demonstration here, note that merging this 0-row table with one of the originals renders a consistent schema. (No need to do this with real data unless you want to verify one or two.)
row1s[mt1, on = intersect(names(row1s), names(mt1))]
# mpg cyl disp hp
# 1: 21.0 6 160 NA
# 2: 21.0 6 160 NA
# 3: 22.8 4 108 NA
row1s[mt2, on = intersect(names(row1s), names(mt2))]
# mpg cyl disp hp
# 1: 22.8 4 NA 93
# 2: 21.4 6 NA 110
The goal will be to do this for all files programmatically:
# iterate through each file: read, left-join, write
for (fn in filenames) {
dat <- fread(fn)
dat <- row1s[dat, on = intersect(names(row1s), names(dat))]
fwrite(dat, file.path(dirname(fn), paste0("augm_", basename(fn))), sep = "\t")
}
newfilenames <- list.files(".", pattern = "^augm_mt.*\\.tsv$", full.names = TRUE)
newfilenames
# [1] "./augm_mt1.tsv" "./augm_mt2.tsv" "./augm_mt3.tsv"
To verify the new files look consistent, look for the double-\t
(which indicate empty data, i.e. NA
on import):
# double-\t indicates an empty field
lapply(newfilenames, readLines, n = 2)
# [[1]]
# [1] "mpg\tcyl\tdisp\thp" "21\t6\t160\t"
# [[2]]
# [1] "mpg\tcyl\tdisp\thp" "22.8\t4\t\t93"
# [[3]]
# [1] "mpg\tcyl\tdisp\thp" "18.7\t\t360\t175"
Now that we have this, let's go to a command-prompt (on windows, git-bash or just windows' bash if you need it). We need bash
and one of tail
or grep
. The goal is that the want the column headers from one of these augm_mt
files but none of the others.
If we naively concatenate the files, we'll see the header row repeated in the middle of the data ... and with R, this means that every column will be character
, likely not what you want:
$ cat augm_mt1.tsv augm_mt2.tsv
mpg cyl disp hp
21 6 160
21 6 160
22.8 4 108
mpg cyl disp hp
22.8 4 93
21.4 6 110
Three options to side-step this, depending on what tools you have and how much you trust the contents of your data. (I suggest number 1, tail
, if you have it, as it is the least-ambiguous.)
If you have tail
, then we can "start at line 2" (skipping line 1) for each file:
$ cat augm_mt2.tsv
mpg cyl disp hp
22.8 4 93
21.4 6 110
$ tail -n +2 augm_mt2.tsv
22.8 4 93
21.4 6 110
If you run this on multiple files, it tends to prepend the filename with each set of tailed lines (try it), we'll suppress that in favor of contiguous lines by adding the -q
option.
If you know that one or more of the column names are never seen in the real contents, then you can do one of:
$ grep -v mpg augm_mt2.tsv
22.8 4 93
21.4 6 110
$ grep -v 'mpg.*cyl.*disp' augm_mt3.tsv
18.7 360 175
18.1 225 105
14.3 360 245
24.4 146.7 62
22.8 140.8 95
19.2 167.6 123
More complicated but should be safer than "hand-written regex" as in number 2:
$ HDR=$(head -n 1 augm_mt2.tsv)
$ grep -F "$HDR" -v augm_mt2.tsv
22.8 4 93
21.4 6 110
(The -F
means "fixed string", so no regex matching is attempted. This is safest, since something like a period in a column name could pose a potential risk. Remote, but non-zero.)
Whichever way you choose to go, this is how you combine those three files into one mega-file for reading back into R:
$ { head -n 1 augm_mt1.tsv ; tail -q -n +2 augm_*.tsv ; } > alldata_mt.tsv
The head -n 1
merely outputs the header row only, no data, making it much easier to do augm_*.tsv
in the next command. (Otherwise we would need to find a way to do everything do all-but-the-first.)
Now we can read this back into R in one command:
fread("alldata_mt.tsv")
# mpg cyl disp hp
# 1: 21.0 6 160.0 NA
# 2: 21.0 6 160.0 NA
# 3: 22.8 4 108.0 NA
# 4: 22.8 4 NA 93
# 5: 21.4 6 NA 110
# 6: 18.7 NA 360.0 175
# 7: 18.1 NA 225.0 105
# 8: 14.3 NA 360.0 245
# 9: 24.4 NA 146.7 62
# 10: 22.8 NA 140.8 95
# 11: 19.2 NA 167.6 123
And for validation with this micro-data:
all.equal(fread("alldata_mt.tsv"), combined)
# [1] TRUE
Alternative that leaves intermediate files with no column headers, so we don't have to dance around things:
for (fn in filenames) {
dat <- fread(fn)
dat <- row1s[dat, on = intersect(names(row1s), names(dat))]
fwrite(dat, file.path(dirname(fn), paste0("augm_", basename(fn))), sep = "\t", col.names = FALSE)
}
Then in bash:
$ cat augm_*tsv > alldata2_mt.tsv
Then in R again,
fread("alldata2_mt.tsv", header = FALSE)
# V1 V2 V3 V4
# 1: 21.0 6 160.0 NA
# 2: 21.0 6 160.0 NA
# 3: 22.8 4 108.0 NA
# 4: 22.8 4 NA 93
# 5: 21.4 6 NA 110
# 6: 18.7 NA 360.0 175
# 7: 18.1 NA 225.0 105
# 8: 14.3 NA 360.0 245
# 9: 24.4 NA 146.7 62
# 10: 22.8 NA 140.8 95
# 11: 19.2 NA 167.6 123
... and you'll have to know the names to reassign them. This method seems like a little less work (it is), but it does leave the remote possibility that the order of column names is inadvertently changed. The first method above that preserves column names in all of the files preempts that potential mis-step.
Upvotes: 3