R: how to "coalesce ()" a whole data table

Question

With reference to question: R: stacking up values from rows of a data frame

Hi,

I have a large data.table (1 millions rows). Each row consists in 30 values, and the first four identify a single object. Many rows may have identical first four fields, meaning that they are referred to the same object. The remaining 26 fields may be a number or "NA".

The point is that many rows referred to the same object (or, with identical first four fields) will have the remaining 26 fields differently defined, and usually consisting in lots of NA and only few numeric values.

I want to merge all the multiple references (rows) to a single object into a single row, exacty as dplyr::coalesce (x,y) would do over 2 vectors. I was trying to identify all the subsets into the table and merging them line by line using some loops, but it is very slow. Is there a faster alternative that can handle large data tables (~1 million rows, 30 fields per row) in an efficient way?

Many thanks.

Chase · Accepted Answer

What do you want to do when you have multiple valid answers for the same group? You need some sort of aggregation mechanism, i.e. min/max/average/etc. Here's one solution for you leveraging the by and .SD operators in data.table. In my example, I am taking the max of x1, x2, and x3 by id. This collapses three rows per id into one.

library(data.table)
dt <- data.table(id = rep(1:4, each = 3), x1 = c(1,NA,2,NA,3,NA,4,NA,5,NA,6,NA), x2 = c(NA,6,NA,5,NA,4,NA,3,NA,2,NA,NA), x3 = c(12,NA,NA,11, NA, NA, 10, NA,NA,NA, NA, NA))
dt[, lapply(.SD, max, na.rm = TRUE),keyby = id]
#> Warning in gmax(x3, na.rm = TRUE): No non-missing values found in at least
#> one group. Returning '-Inf' for such groups to be consistent with base
#>    id x1 x2   x3
#> 1:  1  2  6   12
#> 2:  2  3  5   11
#> 3:  3  5  3   10
#> 4:  4  6  2 -Inf

^{Created on 2019-02-23 by the reprex package (v0.2.1)}

Note the warning re: -Inf which you could clean up after the fact easily.

R: how to "coalesce ()" a whole data table

Answers (2)

Related Questions

R: how to &quot;coalesce ()&quot; a whole data table

Answers (2)

Related Questions

R: how to "coalesce ()" a whole data table