Max
Max

Reputation: 185

Compare three column pairs and retain occurrences with latest timestamp from fourth column

Suppose I have the following data frame, which is a deduced version of my actual date set. Currently, I am shaping the dataset in tidyverse and all its sub-packages.

set.seed(1111)
productlist <- c("product_1", "product_2")
ITEM <- sample(productlist, 10, replace = TRUE)
QTY <- sample(1:10, 10, replace = TRUE)
TIMESTAMP <- sample(seq(as.POSIXct("2020-09-09 01:00:00"), as.POSIXct("2020-09-09 14:00:00"), by="hour"), 10)
REV_DATE <- sample(seq(as.Date('2020-09-07'), as.Date('2020-09-11'), by="day"), 10, replace = TRUE)
DEL_DATE <- sample(seq(as.Date('2020-09-21'), as.Date('2020-09-24'), by="day"), 10, replace = TRUE)
df.orders <- data.frame(ITEM, QTY, TIMESTAMP, REV_DATE, DEL_DATE)
df.orders <- df.orders %>%
  arrange(REV_DATE, DEL_DATE, ITEM) %>%
  mutate(ID = row_number()) %>%
  select(ID, everything())

Thus, the output would look like this.

+----+-----------+-----+---------------------+------------+------------+
| ID |      ITEM | QTY |           TIMESTAMP | REV_DATE   | DEL_DATE   |
+----+-----------+-----+---------------------+------------+------------+
|  1 | product_1 |   2 | 2020-09-09 11:00:00 | 2020-09-07 | 2020-09-21 |
+----+-----------+-----+---------------------+------------+------------+
|  2 | product_2 |   2 | 2020-09-09 06:00:00 | 2020-09-08 | 2020-09-21 |
+----+-----------+-----+---------------------+------------+------------+
|  3 | product_2 |   7 | 2020-09-09 05:00:00 | 2020-09-08 | 2020-09-21 |
+----+-----------+-----+---------------------+------------+------------+
|  4 | product_2 |   1 | 2020-09-09 01:00:00 | 2020-09-08 | 2020-09-22 |
+----+-----------+-----+---------------------+------------+------------+
|  5 | product_2 |   8 | 2020-09-09 02:00:00 | 2020-09-08 | 2020-09-24 |
+----+-----------+-----+---------------------+------------+------------+
|  6 | product_2 |   1 | 2020-09-09 13:00:00 | 2020-09-09 | 2020-09-23 |
+----+-----------+-----+---------------------+------------+------------+
|  7 | product_1 |   5 | 2020-09-09 12:00:00 | 2020-09-10 | 2020-09-23 |
+----+-----------+-----+---------------------+------------+------------+
|  8 | product_2 |   7 | 2020-09-09 04:00:00 | 2020-09-10 | 2020-09-23 |
+----+-----------+-----+---------------------+------------+------------+
|  9 | product_2 |   5 | 2020-09-09 09:00:00 | 2020-09-10 | 2020-09-24 |
+----+-----------+-----+---------------------+------------+------------+
| 10 | product_2 |   6 | 2020-09-09 14:00:00 | 2020-09-11 | 2020-09-21 |
+----+-----------+-----+---------------------+------------+------------+

My intention is to keep only unique combinations of values in the ITEM, REV_DATE, and DEL_DATE column.

df.orders %>%
  group_by(ITEM, DEL_DATE, REV_DATE) %>%
  filter(n()>1)

# A tibble: 2 x 6
# Groups:   ITEM, DEL_DATE, REV_DATE [1]
     ID ITEM        QTY TIMESTAMP           REV_DATE   DEL_DATE  
  <int> <chr>     <int> <dttm>              <date>     <date>    
1     2 product_2     2 2020-09-09 06:00:00 2020-09-08 2020-09-21
2     3 product_2     7 2020-09-09 05:00:00 2020-09-08 2020-09-21
nrow(unique(df.orders[,c("ITEM", "DEL_DATE", "REV_DATE")]))

[1] 9

As it can be seen that ID 2 and 3 are identical in terms of ITEM, REV_DATE, and DEL_DATE, whereby TIMESTAMP and QTY differ. In the original dataset, there are occurrence where I have 5-6 instances where only the I would like keep only the row with the latest TIMESTAMP, in this case ID=2, since the timestamp indicates the most recent version of the occurrence despite the difference in QTY. So that the data frame looks like this in the end.

+----+-----------+-----+---------------------+------------+------------+
| ID |      ITEM | QTY |           TIMESTAMP | REV_DATE   | DEL_DATE   |
+----+-----------+-----+---------------------+------------+------------+
|  1 | product_1 |   2 | 2020-09-09 11:00:00 | 2020-09-07 | 2020-09-21 |
+----+-----------+-----+---------------------+------------+------------+
|  2 | product_2 |   2 | 2020-09-09 06:00:00 | 2020-09-08 | 2020-09-21 |
+----+-----------+-----+---------------------+------------+------------+
|  3 | product_2 |   1 | 2020-09-09 01:00:00 | 2020-09-08 | 2020-09-22 |
+----+-----------+-----+---------------------+------------+------------+
|  4 | product_2 |   8 | 2020-09-09 02:00:00 | 2020-09-08 | 2020-09-24 |
+----+-----------+-----+---------------------+------------+------------+
|  5 | product_2 |   1 | 2020-09-09 13:00:00 | 2020-09-09 | 2020-09-23 |
+----+-----------+-----+---------------------+------------+------------+
|  6 | product_1 |   5 | 2020-09-09 12:00:00 | 2020-09-10 | 2020-09-23 |
+----+-----------+-----+---------------------+------------+------------+
|  7 | product_2 |   7 | 2020-09-09 04:00:00 | 2020-09-10 | 2020-09-23 |
+----+-----------+-----+---------------------+------------+------------+
|  8 | product_2 |   5 | 2020-09-09 09:00:00 | 2020-09-10 | 2020-09-24 |
+----+-----------+-----+---------------------+------------+------------+
|  9 | product_2 |   6 | 2020-09-09 14:00:00 | 2020-09-11 | 2020-09-21 |
+----+-----------+-----+---------------------+------------+------------+

Upvotes: 1

Views: 69

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 389135

You can select the row with max TIMESTAMP for each ITEM, DEL_DATE and REV_DATE.

library(dplyr)

df.orders %>%
  group_by(ITEM, DEL_DATE, REV_DATE) %>%
  slice(which.max(TIMESTAMP))

#     ID ITEM        QTY TIMESTAMP           REV_DATE   DEL_DATE  
#  <int> <chr>     <int> <dttm>              <date>     <date>    
#1     1 product_1     2 2020-09-09 11:00:00 2020-09-07 2020-09-21
#2     7 product_1     5 2020-09-09 12:00:00 2020-09-10 2020-09-23
#3     2 product_2     2 2020-09-09 06:00:00 2020-09-08 2020-09-21
#4    10 product_2     6 2020-09-09 14:00:00 2020-09-11 2020-09-21
#5     4 product_2     1 2020-09-09 01:00:00 2020-09-08 2020-09-22
#6     6 product_2     1 2020-09-09 13:00:00 2020-09-09 2020-09-23
#7     8 product_2     7 2020-09-09 04:00:00 2020-09-10 2020-09-23
#8     5 product_2     8 2020-09-09 02:00:00 2020-09-08 2020-09-24
#9     9 product_2     5 2020-09-09 09:00:00 2020-09-10 2020-09-24

If there are cases where there are two or more rows in the group with same TIMESTAMP and you want to keep all of them, you can use filter in the following way.

df.orders %>%
  group_by(ITEM, DEL_DATE, REV_DATE) %>%
  filter(TIMESTAMP == max(TIMESTAMP))

Upvotes: 1

TTS
TTS

Reputation: 1928

The distinct function keeps the first row (the ?distinct note on ... reads:

If there are multiple rows for a given combination of inputs, only the first row will be preserved

So if we arrange the dataframe with the most recent timestamps at the top and run distinct we will preserve the records you want.

new_df <- df.orders %>%
  group_by(ITEM, DEL_DATE, REV_DATE) %>%
  arrange(desc(TIMESTAMP)) %>%
  distinct(ITEM, .keep_all = TRUE)
     ID ITEM        QTY TIMESTAMP           REV_DATE   DEL_DATE  
  <int> <chr>     <int> <dttm>              <date>     <date>    
1    10 product_2     6 2020-09-09 14:00:00 2020-09-11 2020-09-21
2     6 product_2     1 2020-09-09 13:00:00 2020-09-09 2020-09-23
3     7 product_1     5 2020-09-09 12:00:00 2020-09-10 2020-09-23
4     1 product_1     2 2020-09-09 11:00:00 2020-09-07 2020-09-21
5     9 product_2     5 2020-09-09 09:00:00 2020-09-10 2020-09-24
6     2 product_2     2 2020-09-09 06:00:00 2020-09-08 2020-09-21
7     8 product_2     7 2020-09-09 04:00:00 2020-09-10 2020-09-23
8     5 product_2     8 2020-09-09 02:00:00 2020-09-08 2020-09-24
9     4 product_2     1 2020-09-09 01:00:00 2020-09-08 2020-09-22

Upvotes: 2

Related Questions