Reputation: 13
I have two datasets: dataset1
and dataset2
.
zz <- "id_customer id_order order_date
1 1 2018-10
1 2 2018-11
2 3 2019-05
3 4 2019-06"
dataset1 <- read.table(text=zz, header=TRUE)
yy <- "id_customer order_date
1 2018-10
3 2019-06"
dataset2 <- read.table(text=yy, header=TRUE)
dataset2
is the result of a query where I have two columns: id_customer
and date
(format YYYY-mm).
Those correspond to customers which have a different status than the others in the source dataset (dataset1
), for a specified month.
dataset1
is a list of transactions where I have id_customer
, id_order
and date
(format YYYY-mm as well).
I want to enrich dataset1
with a "flag" column for each line set to 1 if the customer id appears in dataset2
, during the corresponding month.
I have tried something as follows:
dataset$flag <- ifelse(dataset1$id_customer %in% dataset2$id_customer &
dataset1$date == dataset2$date,
"1", "0")
But I get a warning message that says 'longer object length is not a multiple of shorter object length'. I understand that but cannot come up with a solution. Could someone please help?
Upvotes: 1
Views: 1182
Reputation: 422
It is very is to add a corresponding flag in a data.table
way:
# Load library
library(data.table)
# Convert created tables to data.table object
setDT(dataset1)
setDT(dataset2)
# Add {0, 1} to dataset1 if the row can be found in dataset2
dataset1[, flag := 0][dataset2, flag := 1, on = .(id_customer, order_date)]
The result looks as follows:
> dataset1
id_customer id_order order_date flag
1: 1 1 2018-10 1
2: 1 2 2018-11 0
3: 2 3 2019-05 0
4: 3 4 2019-06 1
A bit more manipulations would be needed if you would have the full date/time in the datasets.
Upvotes: 1
Reputation: 34406
You can add a flag to dataset2 then use merge()
, keeping all rows from dataset1. Borrowing Chris' data:
dataset2$flag <- 1
merge(dataset1, dataset2, all.x = TRUE)
ID Date flag
1 1 2018-12 NA
2 1 2019-11 NA
3 2 2018-13 NA
4 2 2019-10 NA
5 2 2019-11 1
6 2 2019-12 NA
7 2 2019-12 NA
8 3 2018-12 1
9 3 2018-12 1
10 4 2018-13 1
Upvotes: 1
Reputation: 21400
EDIT: This seems to work:
Illustrative data:
set.seed(100)
dt1 <- data.frame(
ID = sample(1:4, 10, replace = T),
Date = paste0(sample(2018:2019, 10, replace = T),"-", sample(10:13, 10, replace = T))
)
dt1
ID Date
1 2 2019-12
2 2 2019-12
3 3 2018-12
4 1 2018-12
5 2 2019-11
6 2 2019-10
7 4 2018-13
8 2 2018-13
9 3 2018-12
10 1 2019-11
dt2 <- data.frame(
ID = sample(1:4, 5, replace = T),
Date = paste0(sample(2018:2019, 5, replace = T),"-", sample(10:13, 5, replace = T))
)
dt2
ID Date
1 2 2019-11
2 4 2018-13
3 2 2019-13
4 4 2019-13
5 3 2018-12
SOLUTION:
The solution uses ifelse
to define a condition upon which to set the 'flag' 1
(as specified in the OP). That condition implies a match between dt1
and dt2
; thus we're using match
. A complicating factor is that the condition requires a double match between two columns in each dataframe. Therefore, we use apply
to paste the rows in the two columns together using paste0
and search for matches in these compound strings:
dt1$flag <- ifelse(match(apply(dt1[,1:2], 1, paste0, collapse = " "),
apply(dt2[,1:2], 1, paste0, collapse = " ")), 1, "NA")
RESULT:
dt1
ID Date flag
1 2 2019-12 NA
2 2 2019-12 NA
3 3 2018-12 1
4 1 2018-12 NA
5 2 2019-11 1
6 2 2019-10 NA
7 4 2018-13 1
8 2 2018-13 NA
9 3 2018-12 1
10 1 2019-11 NA
To check the results we can compare them with the results obtained from merge
:
flagged_only <- merge(dt1, dt2)
flagged_only
ID Date
1 2 2019-11
2 3 2018-12
3 3 2018-12
4 4 2018-13
The dataframe flagged_only
contains exactly the same four rows as the ones flagged 1
in dt1
-- voilà!
Upvotes: 0