Compare three column pairs and retain occurrences with latest timestamp from fourth column

Question

Suppose I have the following data frame, which is a deduced version of my actual date set. Currently, I am shaping the dataset in tidyverse and all its sub-packages.

set.seed(1111)
productlist <- c("product_1", "product_2")
ITEM <- sample(productlist, 10, replace = TRUE)
QTY <- sample(1:10, 10, replace = TRUE)
TIMESTAMP <- sample(seq(as.POSIXct("2020-09-09 01:00:00"), as.POSIXct("2020-09-09 14:00:00"), by="hour"), 10)
REV_DATE <- sample(seq(as.Date('2020-09-07'), as.Date('2020-09-11'), by="day"), 10, replace = TRUE)
DEL_DATE <- sample(seq(as.Date('2020-09-21'), as.Date('2020-09-24'), by="day"), 10, replace = TRUE)
df.orders <- data.frame(ITEM, QTY, TIMESTAMP, REV_DATE, DEL_DATE)
df.orders <- df.orders %>%
  arrange(REV_DATE, DEL_DATE, ITEM) %>%
  mutate(ID = row_number()) %>%
  select(ID, everything())

Thus, the output would look like this.

+----+-----------+-----+---------------------+------------+------------+
| ID |      ITEM | QTY |           TIMESTAMP | REV_DATE   | DEL_DATE   |
+----+-----------+-----+---------------------+------------+------------+
|  1 | product_1 |   2 | 2020-09-09 11:00:00 | 2020-09-07 | 2020-09-21 |
+----+-----------+-----+---------------------+------------+------------+
|  2 | product_2 |   2 | 2020-09-09 06:00:00 | 2020-09-08 | 2020-09-21 |
+----+-----------+-----+---------------------+------------+------------+
|  3 | product_2 |   7 | 2020-09-09 05:00:00 | 2020-09-08 | 2020-09-21 |
+----+-----------+-----+---------------------+------------+------------+
|  4 | product_2 |   1 | 2020-09-09 01:00:00 | 2020-09-08 | 2020-09-22 |
+----+-----------+-----+---------------------+------------+------------+
|  5 | product_2 |   8 | 2020-09-09 02:00:00 | 2020-09-08 | 2020-09-24 |
+----+-----------+-----+---------------------+------------+------------+
|  6 | product_2 |   1 | 2020-09-09 13:00:00 | 2020-09-09 | 2020-09-23 |
+----+-----------+-----+---------------------+------------+------------+
|  7 | product_1 |   5 | 2020-09-09 12:00:00 | 2020-09-10 | 2020-09-23 |
+----+-----------+-----+---------------------+------------+------------+
|  8 | product_2 |   7 | 2020-09-09 04:00:00 | 2020-09-10 | 2020-09-23 |
+----+-----------+-----+---------------------+------------+------------+
|  9 | product_2 |   5 | 2020-09-09 09:00:00 | 2020-09-10 | 2020-09-24 |
+----+-----------+-----+---------------------+------------+------------+
| 10 | product_2 |   6 | 2020-09-09 14:00:00 | 2020-09-11 | 2020-09-21 |
+----+-----------+-----+---------------------+------------+------------+

My intention is to keep only unique combinations of values in the ITEM, REV_DATE, and DEL_DATE column.

df.orders %>%
  group_by(ITEM, DEL_DATE, REV_DATE) %>%
  filter(n()>1)

# A tibble: 2 x 6
# Groups:   ITEM, DEL_DATE, REV_DATE [1]
     ID ITEM        QTY TIMESTAMP           REV_DATE   DEL_DATE  
                                
1     2 product_2     2 2020-09-09 06:00:00 2020-09-08 2020-09-21
2     3 product_2     7 2020-09-09 05:00:00 2020-09-08 2020-09-21

nrow(unique(df.orders[,c("ITEM", "DEL_DATE", "REV_DATE")]))

[1] 9

As it can be seen that ID 2 and 3 are identical in terms of ITEM, REV_DATE, and DEL_DATE, whereby TIMESTAMP and QTY differ. In the original dataset, there are occurrence where I have 5-6 instances where only the I would like keep only the row with the latest TIMESTAMP, in this case ID=2, since the timestamp indicates the most recent version of the occurrence despite the difference in QTY. So that the data frame looks like this in the end.

+----+-----------+-----+---------------------+------------+------------+
| ID |      ITEM | QTY |           TIMESTAMP | REV_DATE   | DEL_DATE   |
+----+-----------+-----+---------------------+------------+------------+
|  1 | product_1 |   2 | 2020-09-09 11:00:00 | 2020-09-07 | 2020-09-21 |
+----+-----------+-----+---------------------+------------+------------+
|  2 | product_2 |   2 | 2020-09-09 06:00:00 | 2020-09-08 | 2020-09-21 |
+----+-----------+-----+---------------------+------------+------------+
|  3 | product_2 |   1 | 2020-09-09 01:00:00 | 2020-09-08 | 2020-09-22 |
+----+-----------+-----+---------------------+------------+------------+
|  4 | product_2 |   8 | 2020-09-09 02:00:00 | 2020-09-08 | 2020-09-24 |
+----+-----------+-----+---------------------+------------+------------+
|  5 | product_2 |   1 | 2020-09-09 13:00:00 | 2020-09-09 | 2020-09-23 |
+----+-----------+-----+---------------------+------------+------------+
|  6 | product_1 |   5 | 2020-09-09 12:00:00 | 2020-09-10 | 2020-09-23 |
+----+-----------+-----+---------------------+------------+------------+
|  7 | product_2 |   7 | 2020-09-09 04:00:00 | 2020-09-10 | 2020-09-23 |
+----+-----------+-----+---------------------+------------+------------+
|  8 | product_2 |   5 | 2020-09-09 09:00:00 | 2020-09-10 | 2020-09-24 |
+----+-----------+-----+---------------------+------------+------------+
|  9 | product_2 |   6 | 2020-09-09 14:00:00 | 2020-09-11 | 2020-09-21 |
+----+-----------+-----+---------------------+------------+------------+

Ronak Shah · Accepted Answer

You can select the row with max TIMESTAMP for each ITEM, DEL_DATE and REV_DATE.

library(dplyr)

df.orders %>%
  group_by(ITEM, DEL_DATE, REV_DATE) %>%
  slice(which.max(TIMESTAMP))

#     ID ITEM        QTY TIMESTAMP           REV_DATE   DEL_DATE  
#                                
#1     1 product_1     2 2020-09-09 11:00:00 2020-09-07 2020-09-21
#2     7 product_1     5 2020-09-09 12:00:00 2020-09-10 2020-09-23
#3     2 product_2     2 2020-09-09 06:00:00 2020-09-08 2020-09-21
#4    10 product_2     6 2020-09-09 14:00:00 2020-09-11 2020-09-21
#5     4 product_2     1 2020-09-09 01:00:00 2020-09-08 2020-09-22
#6     6 product_2     1 2020-09-09 13:00:00 2020-09-09 2020-09-23
#7     8 product_2     7 2020-09-09 04:00:00 2020-09-10 2020-09-23
#8     5 product_2     8 2020-09-09 02:00:00 2020-09-08 2020-09-24
#9     9 product_2     5 2020-09-09 09:00:00 2020-09-10 2020-09-24

If there are cases where there are two or more rows in the group with same TIMESTAMP and you want to keep all of them, you can use filter in the following way.

df.orders %>%
  group_by(ITEM, DEL_DATE, REV_DATE) %>%
  filter(TIMESTAMP == max(TIMESTAMP))

Compare three column pairs and retain occurrences with latest timestamp from fourth column

Answers (2)

Related Questions