Reputation: 185
Suppose I have the following data frame, which is a deduced version of my actual date set. Currently, I am shaping the dataset in tidyverse
and all its sub-packages.
set.seed(1111)
productlist <- c("product_1", "product_2")
ITEM <- sample(productlist, 10, replace = TRUE)
QTY <- sample(1:10, 10, replace = TRUE)
TIMESTAMP <- sample(seq(as.POSIXct("2020-09-09 01:00:00"), as.POSIXct("2020-09-09 14:00:00"), by="hour"), 10)
REV_DATE <- sample(seq(as.Date('2020-09-07'), as.Date('2020-09-11'), by="day"), 10, replace = TRUE)
DEL_DATE <- sample(seq(as.Date('2020-09-21'), as.Date('2020-09-24'), by="day"), 10, replace = TRUE)
df.orders <- data.frame(ITEM, QTY, TIMESTAMP, REV_DATE, DEL_DATE)
df.orders <- df.orders %>%
arrange(REV_DATE, DEL_DATE, ITEM) %>%
mutate(ID = row_number()) %>%
select(ID, everything())
Thus, the output would look like this.
+----+-----------+-----+---------------------+------------+------------+
| ID | ITEM | QTY | TIMESTAMP | REV_DATE | DEL_DATE |
+----+-----------+-----+---------------------+------------+------------+
| 1 | product_1 | 2 | 2020-09-09 11:00:00 | 2020-09-07 | 2020-09-21 |
+----+-----------+-----+---------------------+------------+------------+
| 2 | product_2 | 2 | 2020-09-09 06:00:00 | 2020-09-08 | 2020-09-21 |
+----+-----------+-----+---------------------+------------+------------+
| 3 | product_2 | 7 | 2020-09-09 05:00:00 | 2020-09-08 | 2020-09-21 |
+----+-----------+-----+---------------------+------------+------------+
| 4 | product_2 | 1 | 2020-09-09 01:00:00 | 2020-09-08 | 2020-09-22 |
+----+-----------+-----+---------------------+------------+------------+
| 5 | product_2 | 8 | 2020-09-09 02:00:00 | 2020-09-08 | 2020-09-24 |
+----+-----------+-----+---------------------+------------+------------+
| 6 | product_2 | 1 | 2020-09-09 13:00:00 | 2020-09-09 | 2020-09-23 |
+----+-----------+-----+---------------------+------------+------------+
| 7 | product_1 | 5 | 2020-09-09 12:00:00 | 2020-09-10 | 2020-09-23 |
+----+-----------+-----+---------------------+------------+------------+
| 8 | product_2 | 7 | 2020-09-09 04:00:00 | 2020-09-10 | 2020-09-23 |
+----+-----------+-----+---------------------+------------+------------+
| 9 | product_2 | 5 | 2020-09-09 09:00:00 | 2020-09-10 | 2020-09-24 |
+----+-----------+-----+---------------------+------------+------------+
| 10 | product_2 | 6 | 2020-09-09 14:00:00 | 2020-09-11 | 2020-09-21 |
+----+-----------+-----+---------------------+------------+------------+
My intention is to keep only unique combinations of values in the ITEM
, REV_DATE
, and DEL_DATE
column.
df.orders %>%
group_by(ITEM, DEL_DATE, REV_DATE) %>%
filter(n()>1)
# A tibble: 2 x 6
# Groups: ITEM, DEL_DATE, REV_DATE [1]
ID ITEM QTY TIMESTAMP REV_DATE DEL_DATE
<int> <chr> <int> <dttm> <date> <date>
1 2 product_2 2 2020-09-09 06:00:00 2020-09-08 2020-09-21
2 3 product_2 7 2020-09-09 05:00:00 2020-09-08 2020-09-21
nrow(unique(df.orders[,c("ITEM", "DEL_DATE", "REV_DATE")]))
[1] 9
As it can be seen that ID 2 and 3 are identical in terms of ITEM
, REV_DATE
, and DEL_DATE
, whereby TIMESTAMP
and QTY
differ. In the original dataset, there are occurrence where I have 5-6 instances where only the
I would like keep only the row with the latest TIMESTAMP
, in this case ID=2, since the timestamp indicates the most recent version of the occurrence despite the difference in QTY
. So that the data frame looks like this in the end.
+----+-----------+-----+---------------------+------------+------------+
| ID | ITEM | QTY | TIMESTAMP | REV_DATE | DEL_DATE |
+----+-----------+-----+---------------------+------------+------------+
| 1 | product_1 | 2 | 2020-09-09 11:00:00 | 2020-09-07 | 2020-09-21 |
+----+-----------+-----+---------------------+------------+------------+
| 2 | product_2 | 2 | 2020-09-09 06:00:00 | 2020-09-08 | 2020-09-21 |
+----+-----------+-----+---------------------+------------+------------+
| 3 | product_2 | 1 | 2020-09-09 01:00:00 | 2020-09-08 | 2020-09-22 |
+----+-----------+-----+---------------------+------------+------------+
| 4 | product_2 | 8 | 2020-09-09 02:00:00 | 2020-09-08 | 2020-09-24 |
+----+-----------+-----+---------------------+------------+------------+
| 5 | product_2 | 1 | 2020-09-09 13:00:00 | 2020-09-09 | 2020-09-23 |
+----+-----------+-----+---------------------+------------+------------+
| 6 | product_1 | 5 | 2020-09-09 12:00:00 | 2020-09-10 | 2020-09-23 |
+----+-----------+-----+---------------------+------------+------------+
| 7 | product_2 | 7 | 2020-09-09 04:00:00 | 2020-09-10 | 2020-09-23 |
+----+-----------+-----+---------------------+------------+------------+
| 8 | product_2 | 5 | 2020-09-09 09:00:00 | 2020-09-10 | 2020-09-24 |
+----+-----------+-----+---------------------+------------+------------+
| 9 | product_2 | 6 | 2020-09-09 14:00:00 | 2020-09-11 | 2020-09-21 |
+----+-----------+-----+---------------------+------------+------------+
Upvotes: 1
Views: 69
Reputation: 389135
You can select the row with max
TIMESTAMP
for each ITEM
, DEL_DATE
and REV_DATE
.
library(dplyr)
df.orders %>%
group_by(ITEM, DEL_DATE, REV_DATE) %>%
slice(which.max(TIMESTAMP))
# ID ITEM QTY TIMESTAMP REV_DATE DEL_DATE
# <int> <chr> <int> <dttm> <date> <date>
#1 1 product_1 2 2020-09-09 11:00:00 2020-09-07 2020-09-21
#2 7 product_1 5 2020-09-09 12:00:00 2020-09-10 2020-09-23
#3 2 product_2 2 2020-09-09 06:00:00 2020-09-08 2020-09-21
#4 10 product_2 6 2020-09-09 14:00:00 2020-09-11 2020-09-21
#5 4 product_2 1 2020-09-09 01:00:00 2020-09-08 2020-09-22
#6 6 product_2 1 2020-09-09 13:00:00 2020-09-09 2020-09-23
#7 8 product_2 7 2020-09-09 04:00:00 2020-09-10 2020-09-23
#8 5 product_2 8 2020-09-09 02:00:00 2020-09-08 2020-09-24
#9 9 product_2 5 2020-09-09 09:00:00 2020-09-10 2020-09-24
If there are cases where there are two or more rows in the group with same TIMESTAMP
and you want to keep all of them, you can use filter
in the following way.
df.orders %>%
group_by(ITEM, DEL_DATE, REV_DATE) %>%
filter(TIMESTAMP == max(TIMESTAMP))
Upvotes: 1
Reputation: 1928
The distinct
function keeps the first row (the ?distinct
note on ...
reads:
If there are multiple rows for a given combination of inputs, only the first row will be preserved
So if we arrange
the dataframe with the most recent timestamps at the top and run distinct
we will preserve the records you want.
new_df <- df.orders %>%
group_by(ITEM, DEL_DATE, REV_DATE) %>%
arrange(desc(TIMESTAMP)) %>%
distinct(ITEM, .keep_all = TRUE)
ID ITEM QTY TIMESTAMP REV_DATE DEL_DATE
<int> <chr> <int> <dttm> <date> <date>
1 10 product_2 6 2020-09-09 14:00:00 2020-09-11 2020-09-21
2 6 product_2 1 2020-09-09 13:00:00 2020-09-09 2020-09-23
3 7 product_1 5 2020-09-09 12:00:00 2020-09-10 2020-09-23
4 1 product_1 2 2020-09-09 11:00:00 2020-09-07 2020-09-21
5 9 product_2 5 2020-09-09 09:00:00 2020-09-10 2020-09-24
6 2 product_2 2 2020-09-09 06:00:00 2020-09-08 2020-09-21
7 8 product_2 7 2020-09-09 04:00:00 2020-09-10 2020-09-23
8 5 product_2 8 2020-09-09 02:00:00 2020-09-08 2020-09-24
9 4 product_2 1 2020-09-09 01:00:00 2020-09-08 2020-09-22
Upvotes: 2