Reputation: 29
I'm trying to figure out how to group by one variable and keep all other variables that have the same variation pattern. Here's a sample df where gp is my grouping variable:
V1 V2 V3 V4 V5 V6 gp
1 0 1 0 0 0 0 x
2 0 0 0 0 1 0 x
3 1 0 1 0 1 1 y
4 0 0 0 0 0 1 x
What I'd like to end up with is:
V1 V3 gp
1 0 0 x
2 0 0 x
3 1 1 y
4 0 0 x
I've tried a bunch of things, but nothing that get's me anywhere close to something useful. My real data will be much larger, but I figure this is a good place to start. In this case, I could do something numeric, but my 'real data' are genomic data, and thus not as easily added as 0s and 1s.
data:
structure(list(V1 = c(0L, 0L, 1L, 0L), V2 = c(1L, 0L, 0L, 0L),
V3 = c(0L, 0L, 1L, 0L), V4 = c(0L, 0L, 0L, 0L), V5 = c(0L,
1L, 1L, 0L), V6 = c(0L, 0L, 1L, 1L), gp = structure(c(1L,
1L, 2L, 1L), .Label = c("x", "y"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))
Upvotes: 0
Views: 406
Reputation: 388807
We can use duplicated
to get columns that are repeated. However, duplicated
works in row-wise fashion so we transpose the original dataframe.
t1 <- t(df[-ncol(df)])
df[c(duplicated(t1) | duplicated(t1, fromLast = TRUE), TRUE)]
# V1 V3 gp
#1 0 0 x
#2 0 0 x
#3 1 1 y
#4 0 0 x
The final TRUE
is to select the last column which is gp
.
Upvotes: 1
Reputation: 2748
You can do this by finding columns that have a bijection from gp
to the values for that column. That is, for every value in gp
(x
or y
) there is exactly one matching value in the variable column e.g. V1
(0
or 1
). The reverse is also true, for every value in a variable column like V1
there is exactly one match in gp
.
To operate on all columns at once, start by pivoting to a longer form. This will also remove the easy duplicates right away.
uniq <- df %>% pivot_longer(-gp) %>% distinct(name, gp, value)
# name gp value
# <chr> <fct> <int>
# 1 V1 x 0
# 2 V2 x 1
# 3 V3 x 0
# 4 V4 x 0
# 5 V5 x 0
# 6 V6 x 0
# 7 V2 x 0
# 8 V5 x 1
# ...
# 14 V6 y 1
# 15 V6 x 1
Then you can find the "exactly once" matches in one direction by counting how often each gp
value shows up per name
. It will be exactly once if it always matches a constant value for the variable column.
match_left <- uniq %>%
count(name, gp) %>%
group_by(name) %>%
filter(max(n) == 1) %>%
distinct(name)
match_left
# # A tibble: 3 x 1
# # Groups: name [3]
# name
# <chr>
# 1 V1
# 2 V3
# 3 V4
Do the same thing, but in reverse for the value column.
match_right <- uniq %>%
count(name, value) %>%
group_by(name) %>%
filter(max(n) == 1) %>%
distinct(name)
match_right
# # A tibble: 2 x 1
# # Groups: name [2]
# name
# <chr>
# 1 V1
# 2 V3
Now that we know which variables to keep, we can merge everything back together and reshape to the wide form.
matches <- df %>%
mutate(i = row_number()) %>%
pivot_longer(-c(i, gp)) %>%
inner_join(match_left, on='name') %>%
inner_join(match_right, on='name') %>%
spread(name, value) %>%
arrange(i) %>%
select(-i)
matches
# # A tibble: 4 x 3
# gp V1 V3
# <fct> <int> <int>
# 1 x 0 0
# 2 x 0 0
# 3 y 1 1
# 4 x 0 0
Upvotes: 1