Tony23
Tony23

Reputation: 85

Find and selecting duplicates rows and comparing their dates

I have a data frame like below with 96,000 lines and varying amounts of the same ID. I want select the rows with the same ID and then choose the Sales amount from the row with the earliest date.

   ID     Date1      Date2      Date3     Sales  
 ------ ---------- ---------- ---------- ------- 
  3351   7/18/18    1/8/2017   9/7/2016   $240   
  3351              9/15/14               $670   
  3351   4/5/2017   9/7/16                       
  8222              6/6/2013   2/5/2008   $943    

In this case for ID 3351 we would pick $670, because Date2 in the second dup is in 2014.

for(i in length(data))
{
  if(duplicated(dat17[i,1]) == TRUE)
    {
       pmin(dat17[1,7:9], dat17[2,7:9])
    }
}

I am trying to use a for loop to find the duplicates and compare, but I'm not sure how to use multiple rows given by the duplicated() function. pmin() only works for vectors and I need to use the data from the data frame.

Upvotes: 0

Views: 474

Answers (3)

akrun
akrun

Reputation: 886968

We don't need to compare a logical expression to TRUE/FALSE. it is already a logical output in base R (No external packages used)

i1 <- !duplicated(dat17[[1]])

Also, before doing the comparison, make sure the 'Date' columns are converted to Date class

dat17[2:4] <- lapply(dat17[2:4], as.Date, format = "%m/%d/%y")
cbind(dat17[i1,], new = do.call(pmin, c(dat17[2:4][i1,], na.rm = TRUE)))
#  ID      Date1      Date2      Date3 Sales        new
#1 3351 2018-07-18 2020-01-08 2020-09-07  $240 2018-07-18
#4 8222       <NA> 2020-06-06 2020-02-05  $943 2020-02-05

Note that the for loop is just looping through one value length(data) returns a single element. Also, not sure what is 'data' here when the next line should 'dat17'

data

dat17 <- structure(list(ID = c(3351L, 3351L, 3351L, 8222L), Date1 = c("7/18/18", 
NA, "4/5/2017", NA), Date2 = c("1/8/2017", "9/15/14", "9/7/16", 
"6/6/2013"), Date3 = c("9/7/2016", NA, NA, "2/5/2008"), Sales = c("$240", 
"$670", NA, "$943")), class = "data.frame", row.names = c(NA, 
-4L))

Upvotes: 1

cardinal40
cardinal40

Reputation: 1263

Here's one option:

library(tidyverse) 
library(lubridate)

df <-
  tibble(
    id = c(3351, 3351, 8222, 8222),
    date1 = c("2018-07-18", NA_character_, "2017-01-03", "2016-03-02"),
    date2 = c("2016-03-04", "2015-03-02", NA_character_, "2017-04-05"),
    sales = c(240, 670, NA_integer_, 300)
  )

df
# A tibble: 4 x 4
     id date1      date2      sales
  <dbl> <chr>      <chr>      <dbl>
1  3351 2018-07-18 2016-03-04   240
2  3351 NA         2015-03-02   670
3  8222 2017-01-03 NA            NA
4  8222 2016-03-02 2017-04-05   300

The first step is to gather the various date columns into a single column:

df %>% 
  gather(key = date_col, value = date, date1:date2)

# A tibble: 8 x 4
     id sales date_col date      
  <dbl> <dbl> <chr>    <chr>     
1  3351   240 date1    2018-07-18
2  3351   670 date1    NA        
3  8222    NA date1    2017-01-03
4  8222   300 date1    2016-03-02
5  3351   240 date2    2016-03-04
6  3351   670 date2    2015-03-02
7  8222    NA date2    NA        
8  8222   300 date2    2017-04-05

Then you can group by the ID and filter to find the earliest date.

df %>% 
  gather(key = date_col, value = date, date1:date2) %>% 
  filter(!is.na(date), !is.na(sales)) %>% 
  mutate(date = ymd(date)) %>% 
  group_by(id) %>% 
  filter(date == min(date)) %>% 
  ungroup()

# A tibble: 2 x 4
     id sales date_col date      
  <dbl> <dbl> <chr>    <date>    
1  8222   300 date1    2016-03-02
2  3351   670 date2    2015-03-02

Upvotes: 1

Adam Sampson
Adam Sampson

Reputation: 2011

One method is to use the dplyr package which makes this sort of thing easier.

library(dplyr)

df <- read.table(text = "ID     Date1      Date2      Date3     Sales  
  3351   7/18/18    1/8/2017   9/7/2016   $240   
  3351   NA         9/15/14    NA         $670   
  3351   4/5/2017   9/7/16     NA         NA         
  8222   NA         6/6/2013   2/5/2008   $943  ",
  stringsAsFactors = FALSE,
  header = TRUE) %>%
  # make sure the date variables are in date format for sorting to work properly
  mutate_at(vars(starts_with("Date")),lubridate::mdy)

df %>%
  # calculate the minimum date for each row using pmin
  mutate(min_date = pmin(Date1,Date2,Date3,na.rm = TRUE)) %>%
  # arrange from lowest date to highest date
  arrange(min_date) %>%
  # for each ID
  group_by(ID) %>%
  # keep the first one
  slice(1) %>%
  # ungroup to allow future calculations
  ungroup()

# A tibble: 2 x 6
#      ID Date1      Date2      Date3      Sales min_date  
#   <int> <date>     <date>     <date>     <chr> <date>    
# 1  3351 NA         2014-09-15 NA         $670  2014-09-15
# 2  8222 NA         2013-06-06 2008-02-05 $943  2008-02-05

Upvotes: 4

Related Questions