Reputation: 33
I have a data frame with duplicated IDs which can look like this
+-----+------+------------------+
| ID + Name + other columns....|
+-----+------+------------------+
| 1 + AAA + |
| 1 + BBB + |
| 2 + ABA + |
| 2 + ACA + |
| 2 + CCC + |
| 3 + DDD + |
| 4 + EEE + |
| 4 + EEE + |
| 4 + FFF + |
| . + + |
+-----+------+------------------+
I want to find which duplicated ID that have not the same value in the Name column. I can find the duplicated IDs but I want to compare the column "Name" in the same data frame based on equal IDs.
Upvotes: 0
Views: 92
Reputation: 1937
This will give you a new column where TRUE are the the rows with duplicated IDs and same Name:
df=tibble(ID=c(1,1,2,2,2,3,4,4,4),Name=c("AAA","BBB","ABA","ACA","CCC","DDD","EEE","EEE","FFF"))
df0=df%>%group_by(ID)%>%mutate(x=duplicated(Name))
Your current df gives True only in row 8 (ID==4 & Name==EEE)
ID Name x
<dbl> <chr> <lgl>
1 1.00 AAA F
2 1.00 BBB F
3 2.00 ABA F
4 2.00 ACA F
5 2.00 CCC F
6 3.00 ABA F
7 4.00 EEE F
8 4.00 EEE T
9 4.00 FFF F
If you change your df to have other matching Names for the same ID ('ABA'):
df=tibble(ID=c(1,1,2,2,2,3,4,4,4),Name=c("AAA","BBB","ABA","ABA","CCC","DDD","EEE","EEE","FFF"))
you will get more Trues:
ID Name x
<dbl> <chr> <lgl>
1 1.00 AAA F
2 1.00 BBB F
3 2.00 ABA F
4 2.00 ABA T
5 2.00 CCC F
6 3.00 DDD F
7 4.00 EEE F
8 4.00 EEE T
9 4.00 FFF F
However if the same name appears for a difference ID:
df=tibble(ID=c(1,1,2,2,2,3,4,4,4),Name=c("AAA","BBB","ABA","ACA","CCC","ABA","EEE","EEE","FFF"))
there will be no new match:
ID Name x
<dbl> <chr> <lgl>
1 1.00 AAA F
2 1.00 BBB F
3 2.00 ABA F
4 2.00 ACA F
5 2.00 CCC F
6 3.00 ABA F
7 4.00 EEE F
8 4.00 EEE T
9 4.00 FFF F
Upvotes: 0
Reputation: 6768
Here is a solution using dplyr
.
library(dplyr)
df %>%
group_by(ID) %>%
filter(n() > 1) %>% # select only duplicated rows
mutate(Unique_Name = n_distinct(Name)) %>% # number of distinct Name values
filter(Unique_Name != 1) # select rows that have not unique Name values
# or just
df %>%
group_by(ID) %>%
filter(n() > 1) %>% # select only duplicated rows
filter(n_distinct(Name) != 1) # select rows that have not unique Name values
# Data
df <- structure(list(ID = c(1L, 1L, 2L, 2L, 2L, 3L, 4L, 4L), Name = structure(c(1L,
4L, 2L, 3L, 5L, 6L, 7L, 7L), .Label = c("AAA", "ABA", "ACA",
"BBB", "CCC", "DDD", "EEE"), class = "factor")), .Names = c("ID",
"Name"), class = "data.frame", row.names = c(NA, -8L))
Upvotes: 1
Reputation: 887831
We can try
names(which(rowSums(table(df1[1:2]) != 0) == 1))
It is not clear whether the logic is to find the IDs that have all unique
'Name' or not. If that is the case
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(n_distinct(Name)== n()) %>%
pull(ID) %>%
unique
Upvotes: 0