marco
marco

Reputation: 589

R: how to "coalesce ()" a whole data table

With reference to question: R: stacking up values from rows of a data frame

Hi,

I have a large data.table (1 millions rows). Each row consists in 30 values, and the first four identify a single object. Many rows may have identical first four fields, meaning that they are referred to the same object. The remaining 26 fields may be a number or "NA".

The point is that many rows referred to the same object (or, with identical first four fields) will have the remaining 26 fields differently defined, and usually consisting in lots of NA and only few numeric values.

I want to merge all the multiple references (rows) to a single object into a single row, exacty as dplyr::coalesce (x,y) would do over 2 vectors. I was trying to identify all the subsets into the table and merging them line by line using some loops, but it is very slow. Is there a faster alternative that can handle large data tables (~1 million rows, 30 fields per row) in an efficient way?

Many thanks.

Upvotes: 0

Views: 1611

Answers (2)

lroha
lroha

Reputation: 34621

I'm not a hundred percent certain I've understood your expected output but you can pass any number of vectors to coalesce() which will result in a single vector of the first non-missing value from left to right. It is fast on large data frames too.

v <- c(1:10, rep(NA, 3))
set.seed(5)
df <- data.frame(replicate(30, sample(v, 1e6, replace = TRUE)))

system.time(coalesce(!!!df))

user  system elapsed 
0.07    0.08    0.16 

Upvotes: 1

Chase
Chase

Reputation: 69231

What do you want to do when you have multiple valid answers for the same group? You need some sort of aggregation mechanism, i.e. min/max/average/etc. Here's one solution for you leveraging the by and .SD operators in data.table. In my example, I am taking the max of x1, x2, and x3 by id. This collapses three rows per id into one.

library(data.table)
dt <- data.table(id = rep(1:4, each = 3), x1 = c(1,NA,2,NA,3,NA,4,NA,5,NA,6,NA), x2 = c(NA,6,NA,5,NA,4,NA,3,NA,2,NA,NA), x3 = c(12,NA,NA,11, NA, NA, 10, NA,NA,NA, NA, NA))
dt[, lapply(.SD, max, na.rm = TRUE),keyby = id]
#> Warning in gmax(x3, na.rm = TRUE): No non-missing values found in at least
#> one group. Returning '-Inf' for such groups to be consistent with base
#>    id x1 x2   x3
#> 1:  1  2  6   12
#> 2:  2  3  5   11
#> 3:  3  5  3   10
#> 4:  4  6  2 -Inf

Created on 2019-02-23 by the reprex package (v0.2.1)

Note the warning re: -Inf which you could clean up after the fact easily.

Upvotes: 2

Related Questions