MHtaylor
MHtaylor

Reputation: 29

Group by and keep columns with matching pattern

I'm trying to figure out how to group by one variable and keep all other variables that have the same variation pattern. Here's a sample df where gp is my grouping variable:

   V1 V2 V3 V4 V5 V6 gp
1  0  1  0  0  0  0  x
2  0  0  0  0  1  0  x
3  1  0  1  0  1  1  y
4  0  0  0  0  0  1  x

What I'd like to end up with is:

  V1 V3 gp
1  0  0  x
2  0  0  x
3  1  1  y
4  0  0  x

I've tried a bunch of things, but nothing that get's me anywhere close to something useful. My real data will be much larger, but I figure this is a good place to start. In this case, I could do something numeric, but my 'real data' are genomic data, and thus not as easily added as 0s and 1s.

data:

structure(list(V1 = c(0L, 0L, 1L, 0L), V2 = c(1L, 0L, 0L, 0L), 
    V3 = c(0L, 0L, 1L, 0L), V4 = c(0L, 0L, 0L, 0L), V5 = c(0L, 
    1L, 1L, 0L), V6 = c(0L, 0L, 1L, 1L), gp = structure(c(1L, 
    1L, 2L, 1L), .Label = c("x", "y"), class = "factor")), class = "data.frame", row.names = c(NA, 
-4L))

Upvotes: 0

Views: 406

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388807

We can use duplicated to get columns that are repeated. However, duplicated works in row-wise fashion so we transpose the original dataframe.

t1 <- t(df[-ncol(df)])
df[c(duplicated(t1) | duplicated(t1, fromLast = TRUE), TRUE)]

#  V1 V3 gp
#1  0  0  x
#2  0  0  x
#3  1  1  y
#4  0  0  x

The final TRUE is to select the last column which is gp.

Upvotes: 1

mcskinner
mcskinner

Reputation: 2748

You can do this by finding columns that have a bijection from gp to the values for that column. That is, for every value in gp (x or y) there is exactly one matching value in the variable column e.g. V1 (0 or 1). The reverse is also true, for every value in a variable column like V1 there is exactly one match in gp.

To operate on all columns at once, start by pivoting to a longer form. This will also remove the easy duplicates right away.

uniq <- df %>% pivot_longer(-gp) %>% distinct(name, gp, value)
#    name  gp    value
#    <chr> <fct> <int>
#  1 V1    x         0
#  2 V2    x         1
#  3 V3    x         0
#  4 V4    x         0
#  5 V5    x         0
#  6 V6    x         0
#  7 V2    x         0
#  8 V5    x         1
# ...
# 14 V6    y         1
# 15 V6    x         1

Then you can find the "exactly once" matches in one direction by counting how often each gp value shows up per name. It will be exactly once if it always matches a constant value for the variable column.

match_left <- uniq %>%
  count(name, gp) %>%
  group_by(name) %>%
  filter(max(n) == 1) %>%
  distinct(name)

match_left
# # A tibble: 3 x 1
# # Groups:   name [3]
#   name 
#   <chr>
# 1 V1   
# 2 V3   
# 3 V4   

Do the same thing, but in reverse for the value column.

match_right <- uniq %>%
  count(name, value) %>%
  group_by(name) %>%
  filter(max(n) == 1) %>%
  distinct(name)

match_right
# # A tibble: 2 x 1
# # Groups:   name [2]
#   name 
#   <chr>
# 1 V1   
# 2 V3   

Now that we know which variables to keep, we can merge everything back together and reshape to the wide form.

matches <- df %>%
  mutate(i = row_number()) %>%
  pivot_longer(-c(i, gp)) %>%
  inner_join(match_left, on='name') %>%
  inner_join(match_right, on='name') %>%
  spread(name, value) %>%
  arrange(i) %>%
  select(-i)

matches
# # A tibble: 4 x 3
#   gp       V1    V3
#   <fct> <int> <int>
# 1 x         0     0
# 2 x         0     0
# 3 y         1     1
# 4 x         0     0

Upvotes: 1

Related Questions