Reputation: 173
I am currently using R. I have three columns in which I need to identify duplicates.
This is the dataframe that I am using:
df1 <-data.frame(ID_NUMBER = c(990,50000,52000,764000,764000,764000,1420000,1420000,1470000,1470000,2176000,2176000,2401000,2401000,2667000,2667000,3519000,3721000,3721000,4654000,4654000,4685000),
CalNumber = c(0,1126.61,1152.24,26900.12,26900.2,26910,50673.98,50674.31,52161.18,52161.73,77743.17,77743.7,85593.97,85594.42,94854.76,94855,124033.46,130973.56,130973.59,162935.73,162935.85,163446.89),
Date = c('8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/8/2013' ,'8/16/2008' ,'8/16/2008' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/8/2013' ,'8/8/2013' ,'8/16/2008' ,'8/16/2008' ,'8/8/2013' ,'8/8/2013'))
ID_NUMBER CalNumber Date
990 0 8/8/2013 0:00
50000 1126.61 8/16/2008 0:00
52000 1152.24 8/8/2013 0:00
764000 26900.12 8/8/2013 0:00
764000 26900.2 8/16/2008 0:00
764000 26910 8/16/2008
1420000 50673.98 8/16/2008 0:00
1420000 50674.31 8/8/2013 0:00
1470000 52161.18 8/16/2008 0:00
1470000 52161.73 8/8/2013 0:00
2176000 77743.17 8/16/2008 0:00
2176000 77743.7 8/8/2013 0:00
2401000 85593.97 8/16/2008 0:00
2401000 85594.42 8/8/2013 0:00
2667000 94854.76 8/16/2008 0:00
2667000 94855 8/8/2013 0:00
3519000 124033.46 8/8/2013 0:00
3721000 130973.56 8/8/2013 0:00
3721000 130973.59 8/16/2008 0:00
4654000 162935.73 8/16/2008 0:00
4654000 162935.85 8/8/2013 0:00
4685000 163446.89 8/8/2013 0:00
Duplicates are identified as follows: If the ID_NUMBER is not unique, then subtract the record right below for the ID_Number group. If the delta between the next is <= 1 then it is considered a duplicate. The record that takes precedence would be the Maximum date for that group. That would become primary and second would be flagged as a secondary group. My final result set would have two new flags: isNew and isPrimary. If a duplicate does not exist, then it is considered a new first time record. So isNew would be 'Y' and isPrimary would be 'Y' for non duplicate records. I hope the example of what the result would look like below, explains my ramble a little better. I'm too new R so I don't know where to start..so any suggestions or pointers are greatly appreciated.
ID_NUMBER CalNumber Date CalcDiff IsNew isPrimary
990 0 8/8/2013 -- Y Y
50000 1126.61 8/16/2008 -- Y Y
52000 1152.24 8/8/2013 -- Y Y
764000 26900.12 8/8/2013 -- N Y
764000 26900.2 8/16/2008 .08 N N
764000 26910 8/16/2008 9.8 Y Y
1420000 50673.98 8/16/2008 -- N N
1420000 50674.31 8/8/2013 .33 N Y
1470000 52161.18 8/16/2008 -- N N
1470000 52161.73 8/8/2013 .55 N Y
2176000 77743.17 8/16/2008 -- N Y
2176000 77743.7 8/8/2013 .53 N N
2401000 85593.97 8/16/2008 -- N N
2401000 85594.42 8/8/2013 .45 N Y
2667000 94854.76 8/16/2008 -- N N
2667000 94855 8/8/2013 .24 N Y
3519000 124033.46 8/8/2013 -- Y Y
3721000 130973.56 8/8/2013 -- N Y
3721000 130973.59 8/16/2008 .03 N N
4654000 162935.73 8/16/2008 -- Y Y
4654000 162936.85 8/8/2013 1.12 Y Y
4685000 163446.89 8/8/2013 -- Y Y
Upvotes: 0
Views: 52
Reputation: 12084
This solution requires dplyr
and magrittr
(for the compound assignment pipe). First, I define the data frame:
df <- data.frame(ID_NUMBER = c(990,50000,52000,764000,764000,764000,1420000,1420000,1470000,1470000,2176000,2176000,2401000,2401000,2667000,2667000,3519000,3721000,3721000,4654000,4654000,4685000),
CalNumber = c(0,1126.61,1152.24,26900.12,26900.2,26910,50673.98,50674.31,52161.18,52161.73,77743.17,77743.7,85593.97,85594.42,94854.76,94855,124033.46,130973.56,130973.59,162935.73,162936.85,163446.89),
Date = c('8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/8/2013' ,'8/16/2008' ,'8/16/2008' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/8/2013' ,'8/8/2013' ,'8/16/2008' ,'8/16/2008' ,'8/8/2013' ,'8/8/2013'))
Here, I convert your Date
to a date. Then, I group by ID_NUMBER
and calculate the differences between neighbouring rows. I then apply your criteria using case_when
to determine IsNew
. Finally, I group by ID_NUMBER
again as well as IsNew
and check for the most recent date.
df %<>%
mutate(Date = as.Date(Date, "%m/%d/%Y")) %>%
group_by(ID_NUMBER) %>%
mutate(CalcDiff = c(NA, diff(CalNumber))) %>%
mutate(IsNew = case_when(
n() > 1 & is.na(CalcDiff) & lead(CalcDiff)[1] <=1 ~ "N",
n() > 1 & is.na(CalcDiff) & lead(CalcDiff)[1] > 1 ~ "Y",
n() > 1 & CalcDiff <= 1 ~ "N",
n() > 1 & CalcDiff >1 ~ "Y",
TRUE ~ "Y"
)) %>%
group_by(ID_NUMBER, IsNew) %>%
mutate(IsPrimary = case_when(
Date == max(Date) & IsNew == "N" ~ "Y",
Date != max(Date) & IsNew == "N" ~ "N",
TRUE ~ "Y"
))
Results:
# A tibble: 22 x 6
# Groups: ID_NUMBER, IsNew [14]
# ID_NUMBER CalNumber Date CalcDiff IsNew IsPrimary
# <dbl> <dbl> <date> <dbl> <chr> <chr>
# 1 990 0 2013-08-08 NA Y Y
# 2 50000 1127. 2008-08-16 NA Y Y
# 3 52000 1152. 2013-08-08 NA Y Y
# 4 764000 26900. 2013-08-08 NA N Y
# 5 764000 26900. 2008-08-16 0.08 N N
# 6 764000 26910 2008-08-16 9.80 Y Y
# 7 1420000 50674. 2008-08-16 NA N N
# 8 1420000 50674. 2013-08-08 0.330 N Y
# 9 1470000 52161. 2008-08-16 NA N N
# 10 1470000 52162. 2013-08-08 0.55 N Y
# 11 2176000 77743. 2008-08-16 NA N N
# 12 2176000 77744. 2013-08-08 0.530 N Y
# 13 2401000 85594. 2008-08-16 NA N N
# 14 2401000 85594. 2013-08-08 0.450 N Y
# 15 2667000 94855. 2008-08-16 NA N N
# 16 2667000 94855 2013-08-08 0.24 N Y
# 17 3519000 124033. 2013-08-08 NA Y Y
# 18 3721000 130974. 2013-08-08 NA N Y
# 19 3721000 130974. 2008-08-16 0.0300 N N
# 20 4654000 162936. 2008-08-16 NA Y Y
# 21 4654000 162937. 2013-08-08 1.12 Y Y
# 22 4685000 163447. 2013-08-08 NA Y Y
Upvotes: 2