koolmees
koolmees

Reputation: 2783

Rolling join without duplicates using data.table

Imagine the following datasets:

dt.units <- setDT(structure(list(Tienda = c(55596L, 55596L, 55596L, 55596L), `Bulk no.` = c(9021L, 
9021L, 9021L, 9021L), Date = structure(c(18824, 18894, 18927, 
18955), class = "Date"), Units = c(77L, 36L, 37L, 8L)), row.names = c(NA, 
-4L), class = c("data.table", "data.frame")))

   Tienda Bulk no.       Date Units
1:  55596     9021 2021-07-16    77
2:  55596     9021 2021-09-24    36
3:  55596     9021 2021-10-27    37
4:  55596     9021 2021-11-24     8

dt.invoices <- setDT(structure(list(TIENDA = c(55596, 55596, 55596), BULTO = c(9021L, 
9021L, 9021L), Date = structure(c(18828, 18894, 18932), class = "Date"), 
    Invoiced = c(77L, 36L, 37L)), row.names = c(NA, -3L), class = c("data.table", 
"data.frame")))

   TIENDA BULTO       Date Invoiced
1:  55596  9021 2021-07-20       77
2:  55596  9021 2021-09-24       35
3:  55596  9021 2021-11-01       37

I am trying to join them in the following way:

setkeyv(dt.invoices, c("TIENDA", "BULTO", "Date"))
setkeyv(dt.units, c("Tienda", "Bulk no.", "Date"))
dt.data <- dt.invoices[dt.units, roll = "nearest"]

The output will look like this:

   TIENDA BULTO       Date Invoiced Units
1:  55596  9021 2021-07-16       77    77
2:  55596  9021 2021-09-24       35    36
3:  55596  9021 2021-10-27       37    37
4:  55596  9021 2021-11-24       37     8

In the last row I don't want the 37, I want 0/NA. Turning my join around to dt.units[dt.invoices, roll = "nearest"] is not a solution either because then it will remove the 4th record alltogether (and keep the wrong dates).

In short my desired output would look like this:

   TIENDA BULTO       Date Invoiced Units
1:  55596  9021 2021-07-16       77    77
2:  55596  9021 2021-09-24       35    36
3:  55596  9021 2021-10-27       37    37
4:  55596  9021 2021-11-24       NA     8

Upvotes: 1

Views: 82

Answers (1)

Wimpel
Wimpel

Reputation: 27792

# rolling join, also take x.Date to filter out duplicates later in the process
dt.units[, c("Date_2", "Invoiced") := dt.invoices[dt.units, .(x.Date, Invoiced), roll = "nearest"]][]
#    Tienda Bulk no.       Date Units     Date_2 Invoiced
# 1:  55596     9021 2021-07-16    77 2021-07-20       77
# 2:  55596     9021 2021-09-24    36 2021-09-24       36
# 3:  55596     9021 2021-10-27    37 2021-11-01       37
# 4:  55596     9021 2021-11-24     8 2021-11-01       37  !!<-- duplicate Date_2

#set Invoices to NA on duplicated Date_2 rows
dt.units[duplicated(Date_2), Invoiced := NA_integer_]
# remove temp column Date_2
dt.units[, Date_2 := NULL][]
#    Tienda Bulk no.       Date Units Invoiced
# 1:  55596     9021 2021-07-16    77       77
# 2:  55596     9021 2021-09-24    36       36
# 3:  55596     9021 2021-10-27    37       37
# 4:  55596     9021 2021-11-24     8       NA         

update based on comment below

# rolling join, also take x.Date to filter out duplicates later in the process
dt.units[, c("Date_2", "Invoiced") := dt.invoices[dt.units, .(x.Date, Invoiced), roll = "nearest"]][]
#    Tienda Bulk no.       Date Units     Date_2 Invoiced
# 1:  55596     9021 2021-07-16    77 2021-07-20       77
# 2:  55596     9021 2021-09-24    36 2021-09-24       36
# 3:  55596     9021 2021-10-27    37 2021-11-01       37
# 4:  55596     9021 2021-11-24     8 2021-11-01       37

dt.units[ dt.units[, .I[.N > 1 & !Date == min(Date)], 
                   by = .(Tienda, `Bulk no.`, Date_2)]$V1, 
          Invoiced := NA_integer_][, Date_2 := NULL][]
#    Tienda Bulk no.       Date Units Invoiced
# 1:  55596     9021 2021-07-16    77       77
# 2:  55596     9021 2021-09-24    36       36
# 3:  55596     9021 2021-10-27    37       37
# 4:  55596     9021 2021-11-24     8       NA

Upvotes: 1

Related Questions