Reputation: 589
With reference to question: R: stacking up values from rows of a data frame
Hi,
I have a large data.table (1 millions rows). Each row consists in 30 values, and the first four identify a single object. Many rows may have identical first four fields, meaning that they are referred to the same object. The remaining 26 fields may be a number or "NA".
The point is that many rows referred to the same object (or, with identical first four fields) will have the remaining 26 fields differently defined, and usually consisting in lots of NA and only few numeric values.
I want to merge all the multiple references (rows) to a single object into a single row, exacty as dplyr::coalesce (x,y) would do over 2 vectors. I was trying to identify all the subsets into the table and merging them line by line using some loops, but it is very slow. Is there a faster alternative that can handle large data tables (~1 million rows, 30 fields per row) in an efficient way?
Many thanks.
Upvotes: 0
Views: 1611
Reputation: 34621
I'm not a hundred percent certain I've understood your expected output but you can pass any number of vectors to coalesce()
which will result in a single vector of the first non-missing value from left to right. It is fast on large data frames too.
v <- c(1:10, rep(NA, 3))
set.seed(5)
df <- data.frame(replicate(30, sample(v, 1e6, replace = TRUE)))
system.time(coalesce(!!!df))
user system elapsed
0.07 0.08 0.16
Upvotes: 1
Reputation: 69231
What do you want to do when you have multiple valid answers for the same group? You need some sort of aggregation mechanism, i.e. min/max/average/etc. Here's one solution for you leveraging the by
and .SD
operators in data.table
. In my example, I am taking the max
of x1
, x2
, and x3
by id
. This collapses three rows per id
into one.
library(data.table)
dt <- data.table(id = rep(1:4, each = 3), x1 = c(1,NA,2,NA,3,NA,4,NA,5,NA,6,NA), x2 = c(NA,6,NA,5,NA,4,NA,3,NA,2,NA,NA), x3 = c(12,NA,NA,11, NA, NA, 10, NA,NA,NA, NA, NA))
dt[, lapply(.SD, max, na.rm = TRUE),keyby = id]
#> Warning in gmax(x3, na.rm = TRUE): No non-missing values found in at least
#> one group. Returning '-Inf' for such groups to be consistent with base
#> id x1 x2 x3
#> 1: 1 2 6 12
#> 2: 2 3 5 11
#> 3: 3 5 3 10
#> 4: 4 6 2 -Inf
Created on 2019-02-23 by the reprex package (v0.2.1)
Note the warning re: -Inf
which you could clean up after the fact easily.
Upvotes: 2