Jazz
Jazz

Reputation: 135

Removing duplicated data based on each group using R

I have a dataset which contains employee id, name and their bank account information. Some of these employees have duplicate names with either same employee id or different employee id for same employee name. Few of these employees also have same bank account information for same names while some have different bank account numbers under same name. The aim is to find those employees who have same name but different bank account number. Here's a sample of the data:

| Emp_id |   Name  | Bank Account |
|--------|:-------:|-------------:|
| 123    |   Joan  |         6758 |
| 134    |  Karyn  |         1244 |
| 143    | Larry   | 4900         |
| 143    | Larry   | 5201         |
| 235    | Larry   | 5201         |
| 433    | Larry   | 5201         |
| 231    | Larry   | 5201         |
| 120    | Amy     | 7890         |
| 135    | Amy     | 7890         |
| 150    |  Chris  | 1280         |
| 150    | Chris   | 6565         |
| 900    | Cassy   | 1280         |
| 900    | Cassy   | 9873         |

I had to find the employees who were duplicates based on their names which I could do successfully. Once that was done, I had to identify the employees with same name but different bank account no. Right now the issue is that it is not grouping the employees based on name and searching for different bank account. Instead, it is looking for account numbers of different individuals and if it finds it to be same, it removes one of the duplicate values. For example, Chris and Cassy have same bank account number '1280', so it is identifying it to be same and automatically removing one of Chris's record (bank account no 1280 in the output). The output that I'm getting is as shown below:

| Emp_id |  Name | Bank Account |
|--------|:-----:|-------------:|
| 120    |  Amy  |         7890 |
| 900    | Cassy |         1280 |
| 900    | Cassy | 9873         |
| 150    | Chris | 6565         |
| 143    | Larry | 4900         |
| 143    | Larry | 5201         |

This is the code that I have followed:

sample=data.frame(Id=c("123","134","143","143","235","433","231","120","135","150","150","900","900"),
Name=c("Joan","Karyn","Larry","Larry","Larry","Larry","Larry","Amy","Amy","Chris","Chris","Cassy","Cassy"),
Bank_Account=c("6758","1244","4900","5201","5201","5201","5201","7890","7890","1280","6565","1280","9873"))
n_occur <- data.frame(table(sample$Name))
n_occur=n_occur[n_occur$Freq > 1,]
Duplicates=sample[sample$Name %in% n_occur$Var1[n_occur$Freq > 1],]
Duplicates=Duplicates %>% arrange(Duplicates$Name, Duplicates$Name)
Duplicates=Duplicates[!duplicated(Duplicates$Bank_Account),]

The actual output however, should have considered the bank account nos within each name (same name). The output should look something like this:

| Emp_id |   Name  | Bank Account |
|--------|:-------:|-------------:|
| 900    |  Cassy  |1280          |
| 900    |  Cassy  |9873          |
| 150    |  Chris  | 1280         |
| 150    | Chris   | 6565         |
| 143    | Larry   | 4900         |
| 143    | Larry   | 5201         |

Can someone please direct me towards right code?

Upvotes: 2

Views: 94

Answers (2)

bjorn2bewild
bjorn2bewild

Reputation: 1019

Step 1 - Identifying duplicate names:

step_1 <- sample %>%
  arrange(Name) %>%
  mutate(dup = duplicated(Name)) %>%
  filter(Name %in% unique(as.character(Name[dup == T])))

Step 2 - Identifying duplicate accounts for these names:

step_2 <- step_1 %>%
  group_by(Name, Bank_Account) %>%
  mutate(dup = duplicated(Bank_Account)) %>%
  filter(dup == F)

Upvotes: 0

akrun
akrun

Reputation: 886938

We can use n_distinct to filter

library(dplyr)
sample %>% 
    group_by(Name) %>%
    filter(n() > 1) %>%
    group_by(Id, add = TRUE) %>% 
    filter(n_distinct(Bank_Account) > 1) %>%
    arrange(desc(Id))
# A tibble: 6 x 3
# Groups:   Name, Id [3]
#  Id    Name  Bank_Account
#  <fct> <fct> <fct>       
#1 900   Cassy 1280        
#2 900   Cassy 9873        
#3 150   Chris 1280        
#4 150   Chris 6565        
#5 143   Larry 4900        
#6 143   Larry 5201      

Upvotes: 1

Related Questions