Aaron
Aaron

Reputation: 109

Create all possible combinations of non-NA values for each group ID

Similar to this question but with an added twist:

Given the following data frame:

txt <- "ID    Col1    Col2    Col3    Col4
        1     6       10      NA      NA
        1     5       10      NA      NA
        1     NA      10      15      20
        2     17      25      NA      NA
        2     13      25      NA      NA
        2     NA      25      21      34
        2     NA      25      35      40"
DF <- read.table(text = txt, header = TRUE)

DF
  ID Col1 Col2 Col3 Col4
1  1    6   10   NA   NA
2  1    5   10   NA   NA
3  1   NA   10   15   20
4  2   17   25   NA   NA
5  2   13   25   NA   NA
6  2   NA   25   21   34
7  2   NA   25   35   40

I wish to collapse the rows by group ID (analogous to Col2 in this example), and when more than 1 combination is present per group, to return all combinations, as so:

  ID Col1 Col2 Col3 Col4
1  1    6   10   15   20
2  1    5   10   15   20
3  2   17   25   21   34
4  2   13   25   21   34
5  2   17   25   35   40
6  2   13   25   35   40

Importantly, down the road I'll need this to work on non-numerical data. Any suggestions? Thanks!

Upvotes: 2

Views: 180

Answers (3)

Uwe
Uwe

Reputation: 42544

In a comment, the OP pointed out:

for my purposes I don't care about the arrangement of elements so long as all combinations of Col1 and (Col3&Col4) per group ID exist in the output

So, if I understand correctly, the question is not about collapsing but about creating all possible combinations of non-NA values of columns Col1, Col2, and combined columns (Col3, Col4) for each ID group.

For this, expand() and nesting() from the package can be used to create the combinations. na.omit() removes all rows containing any NA afterwards:

library(dplyr)
library(tidyr)
DF %>% 
  group_by(ID) %>% 
  expand(Col1, Col2, nesting(Col3, Col4)) %>% 
  na.omit() %>% 
  ungroup()
     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     5    10    15    20
2     1     6    10    15    20
3     2    13    25    21    34
4     2    13    25    35    40
5     2    17    25    21    34
6     2    17    25    35    40

This approach should work well also with non-numerical data.

Edit 1

Giving a second thought, I was wondering about the peculiar structure of the input dataset, namely the positions of the NAs:

DF
  ID Col1 Col2 Col3 Col4
1  1    6   10   NA   NA
2  1    5   10   NA   NA
3  1   NA   10   15   20
4  2   17   25   NA   NA
5  2   13   25   NA   NA
6  2   NA   25   21   34
7  2   NA   25   35   40

To me, it appears as if DF was constructed from three separate subsets, the first one for Col1

  ID Col1
1  1    6
2  1    5
4  2   17
5  2   13

the second one for Col2

  ID Col2
1  1   10
4  2   25

the third one for Col3 and Col4

  ID Col3 Col4
3  1   15   20
6  2   21   34
7  2   35   40

Based on this observation, here is a different approach which creates all possible combinations of the subsets by a series of merge operations (cartesian joins) of the subsets:

library(magrittr) # piping used her to improve readability
list("Col1", "Col2", c("Col3", "Col4")) %>% 
  lapply(function(x) DF[c("ID", x)] %>% na.omit %>% unique) %>% 
  Reduce(merge, .)
  ID Col1 Col2 Col3 Col4
1  1    6   10   15   20
2  1    5   10   15   20
3  2   17   25   21   34
4  2   17   25   35   40
5  2   13   25   21   34
6  2   13   25   35   40

Here, lapply() creates a list of subsets of the input dataset which is then merged repeatedly using Reduce().

Edit 2:

With version 4.1.0, R has gained a simple native forward pipe syntax |> and \() as a shorthand notation for function(). With this, the code of Edit 1 can be re-written to use only base R (without ):

list("Col1", "Col2", c("Col3", "Col4")) |> 
  lapply(\(x) DF[c("ID", x)] |> na.omit() |> unique()) |>
  (\(z) Reduce(merge, z))()

Upvotes: 2

akrun
akrun

Reputation: 886948

Grouped by 'ID', fill other columns, ungroup to remove the group attribute and keep the distinct rows

library(dplyr)
library(tidyr)
DF %>% 
    group_by(ID) %>% 
    fill(everything(), .direction = 'updown') %>%
    ungroup %>% 
    distinct(.keep_all = TRUE)

Or may also be

DF %>% 
   group_by(ID) %>% 
   mutate(across(everything(), ~ replace(., is.na(.), 
           rep(.[!is.na(.)], length.out = sum(is.na(.))))))

Or based on the comments

DF %>%
   group_by(ID) %>%
   mutate(across(where(~ any(is.na(.))), ~ {
        i1 <- is.na(.)
        ind <- which(i1)
        i2 <- !i1
        if(i1[1] == 1) rep(.[i2], each = n()/sum(i2)) else 
               rep(.[i2], length.out = n())
     })) %>%
   ungroup %>% 
   distinct(.keep_all = TRUE)

-output

# A tibble: 6 x 5
     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     6    10    15    20
2     1     5    10    15    20
3     2    17    25    21    34
4     2    13    25    21    34
5     2    17    25    35    40
6     2    13    25    35    40

Upvotes: 3

Ronak Shah
Ronak Shah

Reputation: 388817

A data.table option using zoo's na.locf to fill the missing values.

library(zoo)
library(data.table)

setDT(DF)
cols <- grep('Col', names(DF), value = TRUE)
DF[, (cols) := lapply(.SD, function(x) fcoalesce(na.locf(x, na.rm = FALSE), 
                      na.locf(x, na.rm = FALSE, fromLast = TRUE))), ID]
unique(DF)

#   ID Col1 Col2 Col3 Col4
#1:  1    6   10   15   20
#2:  1    5   10   15   20
#3:  2   17   25   21   34
#4:  2   13   25   21   34
#5:  2   13   25   35   40

Upvotes: 3

Related Questions