kh.b
kh.b

Reputation: 33

compare data frame column's

I have a data frame with duplicated IDs which can look like this

+-----+------+------------------+
| ID  + Name + other columns....|
+-----+------+------------------+
|  1  + AAA  +                  |
|  1  + BBB  +                  |   
|  2  + ABA  +                  |
|  2  + ACA  +                  |
|  2  + CCC  +                  |
|  3  + DDD  +                  |
|  4  + EEE  +                  |
|  4  + EEE  +                  |
|  4  + FFF  +                  |
|  .  +      +                  |
+-----+------+------------------+

I want to find which duplicated ID that have not the same value in the Name column. I can find the duplicated IDs but I want to compare the column "Name" in the same data frame based on equal IDs.

Upvotes: 0

Views: 92

Answers (3)

Antonios
Antonios

Reputation: 1937

This will give you a new column where TRUE are the the rows with duplicated IDs and same Name:

df=tibble(ID=c(1,1,2,2,2,3,4,4,4),Name=c("AAA","BBB","ABA","ACA","CCC","DDD","EEE","EEE","FFF"))
df0=df%>%group_by(ID)%>%mutate(x=duplicated(Name))

Your current df gives True only in row 8 (ID==4 & Name==EEE)

 ID Name  x    
  <dbl> <chr> <lgl>
1  1.00 AAA   F    
2  1.00 BBB   F    
3  2.00 ABA   F    
4  2.00 ACA   F    
5  2.00 CCC   F    
6  3.00 ABA   F    
7  4.00 EEE   F    
8  4.00 EEE   T    
9  4.00 FFF   F    

If you change your df to have other matching Names for the same ID ('ABA'):

df=tibble(ID=c(1,1,2,2,2,3,4,4,4),Name=c("AAA","BBB","ABA","ABA","CCC","DDD","EEE","EEE","FFF"))

you will get more Trues:

ID Name  x    
  <dbl> <chr> <lgl>
1  1.00 AAA   F    
2  1.00 BBB   F    
3  2.00 ABA   F    
4  2.00 ABA   T    
5  2.00 CCC   F    
6  3.00 DDD   F    
7  4.00 EEE   F    
8  4.00 EEE   T    
9  4.00 FFF   F    

However if the same name appears for a difference ID:

df=tibble(ID=c(1,1,2,2,2,3,4,4,4),Name=c("AAA","BBB","ABA","ACA","CCC","ABA","EEE","EEE","FFF"))

there will be no new match:

ID Name  x    
  <dbl> <chr> <lgl>
1  1.00 AAA   F    
2  1.00 BBB   F    
3  2.00 ABA   F    
4  2.00 ACA   F    
5  2.00 CCC   F    
6  3.00 ABA   F    
7  4.00 EEE   F    
8  4.00 EEE   T    
9  4.00 FFF   F    

Upvotes: 0

nghauran
nghauran

Reputation: 6768

Here is a solution using dplyr.

library(dplyr)
    df %>%
            group_by(ID) %>% 
            filter(n() > 1) %>% # select only duplicated rows
            mutate(Unique_Name = n_distinct(Name)) %>%  # number of distinct Name values
            filter(Unique_Name != 1)  # select rows that have not unique Name values
    # or just
    df %>%
            group_by(ID) %>% 
            filter(n() > 1) %>% # select only duplicated rows
            filter(n_distinct(Name) != 1)  # select rows that have not unique Name values

# Data
df <- structure(list(ID = c(1L, 1L, 2L, 2L, 2L, 3L, 4L, 4L), Name = structure(c(1L, 
4L, 2L, 3L, 5L, 6L, 7L, 7L), .Label = c("AAA", "ABA", "ACA", 
"BBB", "CCC", "DDD", "EEE"), class = "factor")), .Names = c("ID", 
"Name"), class = "data.frame", row.names = c(NA, -8L))

Upvotes: 1

akrun
akrun

Reputation: 887831

We can try

names(which(rowSums(table(df1[1:2]) != 0) == 1))

It is not clear whether the logic is to find the IDs that have all unique 'Name' or not. If that is the case

library(dplyr)
df1 %>%
  group_by(ID) %>% 
  filter(n_distinct(Name)== n()) %>%
  pull(ID) %>%
  unique

Upvotes: 0

Related Questions