ZayzayR
ZayzayR

Reputation: 305

Matching values in different datasets by groups in R

I have the following two datasets:

df1 <- data.frame(
  "group" = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5), 
  "numbers" = c(55, 75, 60, 55, 75, 60,  55, 75, 60,  55, 75, 60,  55, 75, 60))
df2 <- data.frame(
  "group" = c(1, 1, 2, 2, 2, 3, 3, 4, 5), 
  "P1" = c(55, NA, 60, 55, 75, 75, 55, 55, 60),
  "P2" = c(55, 75, 55, 60, NA, 75, 55, NA, 60),
  "P3" = c(75, 55, 60, 75, NA, 75, 60, 55, 60))

In df1 each group has the same three numbers (in reality there are around 500 numbers).

I want to check whether the values in the column "numbers" in df1 are contained in the columns P1, P2, and P3 of df2. There are two problems I am stuck with. 1. the values in the numbers column of df1 can occur in different groups in df2 (defined by the group column in df1 and df2). 2. the datasets have different lengths. Is there a way to merge both datasets and have the following dataset:

df3 <- data.frame(
  "group"    = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5), 
  "numbers"  = c(55, 75, 60, 55, 75, 60, 55, 75, 60, 55, 75, 60, 55, 75, 60,),
  "P1new"    = c(1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1),
  "P2new"    = c(1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1),
  "P3new"    = c(1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1))

where P1new (P2new and P3new respectively) contain the value 1 if df2$P1 contains the value in df1$numbers within the correct group (as I said numbers can reoccur in different groups). For example, P3 has the value 75 in group 1 but not in group 5. So in group 1 P3new would have a 1 and in group 5 P3new would have a 0. This question is similar to Find matching values in different datasets by groups in R but I could not adapt the code according to my objectives. So, I would really appreciate any help.

Upvotes: 2

Views: 741

Answers (2)

PaulS
PaulS

Reputation: 25313

Another possible solution:

library(tidyverse)

map_dfc(names(df2[-1]), 
        ~ df1 %>%
          group_by(group) %>%
          mutate(!!.x := +(numbers %in% df2[df2$group == cur_group_id(), .x])) %>%
          ungroup %>%
          select(all_of(.x))) %>%
  bind_cols(df1, .)

#>    group numbers P1 P2 P3
#> 1      1      55  1  1  1
#> 2      1      75  0  1  1
#> 3      1      60  0  0  0
#> 4      2      55  1  1  0
#> 5      2      75  1  0  1
#> 6      2      60  1  1  1
#> 7      3      55  1  1  0
#> 8      3      75  1  1  1
#> 9      3      60  0  0  1
#> 10     4      55  1  0  1
#> 11     4      75  0  0  0
#> 12     4      60  0  0  0
#> 13     5      55  0  0  0
#> 14     5      75  0  0  0
#> 15     5      60  1  1  1

Or, without purrr, another possibility:

library(dplyr)

df1 %>% 
  inner_join(df2) %>% 
  group_by(group) %>% 
  mutate(across(starts_with("P"), ~ +(numbers %in% .x))) %>% 
  ungroup %>% 
  distinct 

Upvotes: 0

Ma&#235;l
Ma&#235;l

Reputation: 51894

Interesting question. Here's a way with dplyr functions:

library(dplyr)
df2 %>% 
  group_by(group) %>% 
  summarise(across(P1:P3, ~ list(unique(na.omit(.x))))) %>% 
  inner_join(df1, .) %>% 
  rowwise() %>% 
  mutate(across(P1:P3, ~ +(numbers %in% .x)))

   group numbers    P1    P2    P3
   <dbl>   <dbl> <int> <int> <int>
 1     1      55     1     1     1
 2     1      75     0     1     1
 3     1      60     0     0     0
 4     2      55     1     1     0
 5     2      75     1     0     1
 6     2      60     1     1     1
 7     3      55     1     1     0
 8     3      75     1     1     1
 9     3      60     0     0     1
10     4      55     1     0     1
11     4      75     0     0     0
12     4      60     0     0     0
13     5      55     0     0     0
14     5      75     0     0     0
15     5      60     1     1     1

Upvotes: 2

Related Questions