Reputation: 109
Similar to this question but with an added twist:
Given the following data frame:
txt <- "ID Col1 Col2 Col3 Col4
1 6 10 NA NA
1 5 10 NA NA
1 NA 10 15 20
2 17 25 NA NA
2 13 25 NA NA
2 NA 25 21 34
2 NA 25 35 40"
DF <- read.table(text = txt, header = TRUE)
DF
ID Col1 Col2 Col3 Col4
1 1 6 10 NA NA
2 1 5 10 NA NA
3 1 NA 10 15 20
4 2 17 25 NA NA
5 2 13 25 NA NA
6 2 NA 25 21 34
7 2 NA 25 35 40
I wish to collapse the rows by group ID (analogous to Col2 in this example), and when more than 1 combination is present per group, to return all combinations, as so:
ID Col1 Col2 Col3 Col4
1 1 6 10 15 20
2 1 5 10 15 20
3 2 17 25 21 34
4 2 13 25 21 34
5 2 17 25 35 40
6 2 13 25 35 40
Importantly, down the road I'll need this to work on non-numerical data. Any suggestions? Thanks!
Upvotes: 2
Views: 180
Reputation: 42544
In a comment, the OP pointed out:
for my purposes I don't care about the arrangement of elements so long as all combinations of Col1 and (Col3&Col4) per group ID exist in the output
So, if I understand correctly, the question is not about collapsing but about creating all possible combinations of non-NA values of columns Col1
, Col2
, and combined columns (Col3
, Col4
) for each ID
group.
For this, expand()
and nesting()
from the tidyr package can be used to create the combinations. na.omit()
removes all rows containing any NA
afterwards:
library(dplyr)
library(tidyr)
DF %>%
group_by(ID) %>%
expand(Col1, Col2, nesting(Col3, Col4)) %>%
na.omit() %>%
ungroup()
ID Col1 Col2 Col3 Col4 <int> <int> <int> <int> <int> 1 1 5 10 15 20 2 1 6 10 15 20 3 2 13 25 21 34 4 2 13 25 35 40 5 2 17 25 21 34 6 2 17 25 35 40
This approach should work well also with non-numerical data.
Giving a second thought, I was wondering about the peculiar structure of the input dataset, namely the positions of the NA
s:
DF
ID Col1 Col2 Col3 Col4 1 1 6 10 NA NA 2 1 5 10 NA NA 3 1 NA 10 15 20 4 2 17 25 NA NA 5 2 13 25 NA NA 6 2 NA 25 21 34 7 2 NA 25 35 40
To me, it appears as if DF
was constructed from three separate subsets, the first one for Col1
ID Col1 1 1 6 2 1 5 4 2 17 5 2 13
the second one for Col2
ID Col2 1 1 10 4 2 25
the third one for Col3
and Col4
ID Col3 Col4 3 1 15 20 6 2 21 34 7 2 35 40
Based on this observation, here is a different approach which creates all possible combinations of the subsets by a series of merge operations (cartesian joins) of the subsets:
library(magrittr) # piping used her to improve readability
list("Col1", "Col2", c("Col3", "Col4")) %>%
lapply(function(x) DF[c("ID", x)] %>% na.omit %>% unique) %>%
Reduce(merge, .)
ID Col1 Col2 Col3 Col4 1 1 6 10 15 20 2 1 5 10 15 20 3 2 17 25 21 34 4 2 17 25 35 40 5 2 13 25 21 34 6 2 13 25 35 40
Here, lapply()
creates a list of subsets of the input dataset which is then merged repeatedly using Reduce()
.
With version 4.1.0, R has gained a simple native forward pipe syntax |>
and \()
as a shorthand notation for function()
. With this, the code of Edit 1 can be re-written to use only base R (without magrittr):
list("Col1", "Col2", c("Col3", "Col4")) |>
lapply(\(x) DF[c("ID", x)] |> na.omit() |> unique()) |>
(\(z) Reduce(merge, z))()
Upvotes: 2
Reputation: 886948
Grouped by 'ID', fill
other columns, ungroup
to remove the group attribute and keep the distinct
rows
library(dplyr)
library(tidyr)
DF %>%
group_by(ID) %>%
fill(everything(), .direction = 'updown') %>%
ungroup %>%
distinct(.keep_all = TRUE)
Or may also be
DF %>%
group_by(ID) %>%
mutate(across(everything(), ~ replace(., is.na(.),
rep(.[!is.na(.)], length.out = sum(is.na(.))))))
Or based on the comments
DF %>%
group_by(ID) %>%
mutate(across(where(~ any(is.na(.))), ~ {
i1 <- is.na(.)
ind <- which(i1)
i2 <- !i1
if(i1[1] == 1) rep(.[i2], each = n()/sum(i2)) else
rep(.[i2], length.out = n())
})) %>%
ungroup %>%
distinct(.keep_all = TRUE)
-output
# A tibble: 6 x 5
ID Col1 Col2 Col3 Col4
<int> <int> <int> <int> <int>
1 1 6 10 15 20
2 1 5 10 15 20
3 2 17 25 21 34
4 2 13 25 21 34
5 2 17 25 35 40
6 2 13 25 35 40
Upvotes: 3
Reputation: 388817
A data.table
option using zoo
's na.locf
to fill the missing values.
library(zoo)
library(data.table)
setDT(DF)
cols <- grep('Col', names(DF), value = TRUE)
DF[, (cols) := lapply(.SD, function(x) fcoalesce(na.locf(x, na.rm = FALSE),
na.locf(x, na.rm = FALSE, fromLast = TRUE))), ID]
unique(DF)
# ID Col1 Col2 Col3 Col4
#1: 1 6 10 15 20
#2: 1 5 10 15 20
#3: 2 17 25 21 34
#4: 2 13 25 21 34
#5: 2 13 25 35 40
Upvotes: 3