yanci
yanci

Reputation: 173

R Conditional calculation and row flags within groups

I am currently using R. I have three columns in which I need to identify duplicates.

This is the dataframe that I am using:

df1 <-data.frame(ID_NUMBER = c(990,50000,52000,764000,764000,764000,1420000,1420000,1470000,1470000,2176000,2176000,2401000,2401000,2667000,2667000,3519000,3721000,3721000,4654000,4654000,4685000), 
     CalNumber = c(0,1126.61,1152.24,26900.12,26900.2,26910,50673.98,50674.31,52161.18,52161.73,77743.17,77743.7,85593.97,85594.42,94854.76,94855,124033.46,130973.56,130973.59,162935.73,162935.85,163446.89),

     Date = c('8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/8/2013' ,'8/16/2008' ,'8/16/2008' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/8/2013' ,'8/8/2013' ,'8/16/2008' ,'8/16/2008' ,'8/8/2013' ,'8/8/2013')) 



   ID_NUMBER    CalNumber        Date
     990      0           8/8/2013 0:00
     50000    1126.61     8/16/2008 0:00
     52000    1152.24     8/8/2013 0:00
     764000   26900.12    8/8/2013 0:00
     764000   26900.2     8/16/2008 0:00
     764000   26910       8/16/2008
    1420000   50673.98    8/16/2008 0:00
    1420000   50674.31    8/8/2013 0:00
    1470000   52161.18    8/16/2008 0:00
    1470000   52161.73    8/8/2013 0:00
    2176000   77743.17    8/16/2008 0:00
    2176000   77743.7     8/8/2013 0:00
    2401000   85593.97    8/16/2008 0:00
    2401000   85594.42    8/8/2013 0:00
    2667000   94854.76    8/16/2008 0:00
    2667000   94855       8/8/2013 0:00
    3519000   124033.46   8/8/2013 0:00
    3721000   130973.56   8/8/2013 0:00
    3721000   130973.59   8/16/2008 0:00
    4654000   162935.73   8/16/2008 0:00
    4654000   162935.85   8/8/2013 0:00
    4685000   163446.89   8/8/2013 0:00

Duplicates are identified as follows: If the ID_NUMBER is not unique, then subtract the record right below for the ID_Number group. If the delta between the next is <= 1 then it is considered a duplicate. The record that takes precedence would be the Maximum date for that group. That would become primary and second would be flagged as a secondary group. My final result set would have two new flags: isNew and isPrimary. If a duplicate does not exist, then it is considered a new first time record. So isNew would be 'Y' and isPrimary would be 'Y' for non duplicate records. I hope the example of what the result would look like below, explains my ramble a little better. I'm too new R so I don't know where to start..so any suggestions or pointers are greatly appreciated.

   ID_NUMBER    CalNumber   Date     CalcDiff       IsNew   isPrimary
     990      0           8/8/2013           --          Y         Y
     50000    1126.61     8/16/2008          --          Y         Y
     52000    1152.24     8/8/2013           --          Y         Y
     764000   26900.12    8/8/2013           --          N         Y
     764000   26900.2     8/16/2008          .08         N         N
     764000   26910           8/16/2008          9.8         Y         Y 
    1420000   50673.98    8/16/2008          --          N         N
    1420000   50674.31    8/8/2013           .33         N         Y
    1470000   52161.18    8/16/2008          --          N         N
    1470000   52161.73    8/8/2013           .55         N         Y
    2176000   77743.17    8/16/2008          --          N         Y
    2176000   77743.7     8/8/2013           .53         N         N 
    2401000   85593.97    8/16/2008          --          N         N
    2401000   85594.42    8/8/2013           .45         N         Y 
    2667000   94854.76    8/16/2008          --          N         N   
    2667000   94855          8/8/2013            .24         N         Y
    3519000   124033.46   8/8/2013           --          Y         Y     
    3721000   130973.56   8/8/2013           --          N         Y
    3721000   130973.59   8/16/2008          .03         N         N  
    4654000   162935.73   8/16/2008          --          Y         Y 
    4654000   162936.85   8/8/2013           1.12        Y         Y  
    4685000   163446.89   8/8/2013            --         Y         Y  

Upvotes: 0

Views: 52

Answers (1)

Dan
Dan

Reputation: 12084

This solution requires dplyr and magrittr (for the compound assignment pipe). First, I define the data frame:

df <- data.frame(ID_NUMBER = c(990,50000,52000,764000,764000,764000,1420000,1420000,1470000,1470000,2176000,2176000,2401000,2401000,2667000,2667000,3519000,3721000,3721000,4654000,4654000,4685000), 
             CalNumber = c(0,1126.61,1152.24,26900.12,26900.2,26910,50673.98,50674.31,52161.18,52161.73,77743.17,77743.7,85593.97,85594.42,94854.76,94855,124033.46,130973.56,130973.59,162935.73,162936.85,163446.89),
             Date = c('8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/8/2013' ,'8/16/2008' ,'8/16/2008' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/16/2008' ,'8/8/2013' ,'8/8/2013' ,'8/8/2013' ,'8/16/2008' ,'8/16/2008' ,'8/8/2013' ,'8/8/2013'))

Here, I convert your Date to a date. Then, I group by ID_NUMBER and calculate the differences between neighbouring rows. I then apply your criteria using case_when to determine IsNew. Finally, I group by ID_NUMBER again as well as IsNew and check for the most recent date.

df %<>% 
  mutate(Date = as.Date(Date, "%m/%d/%Y")) %>% 
  group_by(ID_NUMBER) %>% 
  mutate(CalcDiff = c(NA, diff(CalNumber))) %>% 
  mutate(IsNew = case_when(
    n() > 1 & is.na(CalcDiff) & lead(CalcDiff)[1] <=1 ~ "N",
    n() > 1 & is.na(CalcDiff) & lead(CalcDiff)[1] > 1 ~ "Y",
    n() > 1 & CalcDiff <= 1 ~ "N",
    n() > 1 & CalcDiff >1 ~ "Y",
    TRUE ~ "Y"
  )) %>% 
  group_by(ID_NUMBER, IsNew) %>% 
  mutate(IsPrimary = case_when(
    Date == max(Date) & IsNew == "N" ~ "Y",
    Date != max(Date) & IsNew == "N" ~ "N",
    TRUE ~ "Y"
  ))

Results:

# A tibble: 22 x 6
# Groups:   ID_NUMBER, IsNew [14]
# ID_NUMBER CalNumber Date       CalcDiff IsNew IsPrimary
# <dbl>     <dbl> <date>        <dbl> <chr> <chr>    
# 1       990         0  2013-08-08  NA      Y     Y        
# 2      50000     1127. 2008-08-16  NA      Y     Y        
# 3      52000     1152. 2013-08-08  NA      Y     Y        
# 4     764000    26900. 2013-08-08  NA      N     Y        
# 5     764000    26900. 2008-08-16   0.08   N     N        
# 6     764000    26910  2008-08-16   9.80   Y     Y        
# 7    1420000    50674. 2008-08-16  NA      N     N        
# 8    1420000    50674. 2013-08-08   0.330  N     Y        
# 9    1470000    52161. 2008-08-16  NA      N     N        
# 10   1470000    52162. 2013-08-08   0.55   N     Y        
# 11   2176000    77743. 2008-08-16  NA      N     N        
# 12   2176000    77744. 2013-08-08   0.530  N     Y        
# 13   2401000    85594. 2008-08-16  NA      N     N        
# 14   2401000    85594. 2013-08-08   0.450  N     Y        
# 15   2667000    94855. 2008-08-16  NA      N     N        
# 16   2667000    94855  2013-08-08   0.24   N     Y        
# 17   3519000   124033. 2013-08-08  NA      Y     Y        
# 18   3721000   130974. 2013-08-08  NA      N     Y        
# 19   3721000   130974. 2008-08-16   0.0300 N     N        
# 20   4654000   162936. 2008-08-16  NA      Y     Y        
# 21   4654000   162937. 2013-08-08   1.12   Y     Y        
# 22   4685000   163447. 2013-08-08  NA      Y     Y  

Upvotes: 2

Related Questions