Gregoire Maltès
Gregoire Maltès

Reputation: 13

Create a "flag" column in a dataset based on a another table in R

I have two datasets: dataset1 and dataset2.

zz <- "id_customer id_order order_date
1 1 2018-10
1 2 2018-11
2 3 2019-05
3 4 2019-06"

dataset1 <- read.table(text=zz, header=TRUE)

yy <- "id_customer order_date
1 2018-10
3 2019-06"

dataset2 <- read.table(text=yy, header=TRUE)

dataset2 is the result of a query where I have two columns: id_customer and date (format YYYY-mm). Those correspond to customers which have a different status than the others in the source dataset (dataset1), for a specified month.

dataset1 is a list of transactions where I have id_customer, id_order and date (format YYYY-mm as well). I want to enrich dataset1 with a "flag" column for each line set to 1 if the customer id appears in dataset2, during the corresponding month.

I have tried something as follows:

dataset$flag <- ifelse(dataset1$id_customer %in% dataset2$id_customer &
                         dataset1$date == dataset2$date,
                       "1", "0")

But I get a warning message that says 'longer object length is not a multiple of shorter object length'. I understand that but cannot come up with a solution. Could someone please help?

Upvotes: 1

Views: 1182

Answers (3)

Serhii
Serhii

Reputation: 422

It is very is to add a corresponding flag in a data.table way:

# Load library
library(data.table)

# Convert created tables to data.table object
setDT(dataset1)
setDT(dataset2)

# Add {0, 1} to dataset1 if the row can be found in dataset2
dataset1[, flag := 0][dataset2, flag := 1, on = .(id_customer, order_date)]

The result looks as follows:

> dataset1
   id_customer id_order order_date flag
1:           1        1    2018-10    1
2:           1        2    2018-11    0
3:           2        3    2019-05    0
4:           3        4    2019-06    1

A bit more manipulations would be needed if you would have the full date/time in the datasets.

Upvotes: 1

lroha
lroha

Reputation: 34406

You can add a flag to dataset2 then use merge(), keeping all rows from dataset1. Borrowing Chris' data:

dataset2$flag <- 1
merge(dataset1, dataset2, all.x = TRUE)

   ID    Date flag
1   1 2018-12   NA
2   1 2019-11   NA
3   2 2018-13   NA
4   2 2019-10   NA
5   2 2019-11    1
6   2 2019-12   NA
7   2 2019-12   NA
8   3 2018-12    1
9   3 2018-12    1
10  4 2018-13    1

Upvotes: 1

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

EDIT: This seems to work:

Illustrative data:

set.seed(100)
dt1 <- data.frame(
  ID = sample(1:4, 10, replace = T),
  Date = paste0(sample(2018:2019, 10, replace = T),"-", sample(10:13, 10, replace = T))
)
dt1
   ID    Date
1   2 2019-12
2   2 2019-12
3   3 2018-12
4   1 2018-12
5   2 2019-11
6   2 2019-10
7   4 2018-13
8   2 2018-13
9   3 2018-12
10  1 2019-11

dt2 <- data.frame(
  ID = sample(1:4, 5, replace = T),
  Date = paste0(sample(2018:2019, 5, replace = T),"-", sample(10:13, 5, replace = T))
)
dt2
  ID    Date
1  2 2019-11
2  4 2018-13
3  2 2019-13
4  4 2019-13
5  3 2018-12

SOLUTION:

The solution uses ifelse to define a condition upon which to set the 'flag' 1(as specified in the OP). That condition implies a match between dt1and dt2; thus we're using match. A complicating factor is that the condition requires a double match between two columns in each dataframe. Therefore, we use apply to paste the rows in the two columns together using paste0 and search for matches in these compound strings:

dt1$flag <- ifelse(match(apply(dt1[,1:2], 1, paste0, collapse = " "), 
                         apply(dt2[,1:2], 1, paste0, collapse = " ")), 1, "NA")

RESULT:

dt1
   ID    Date flag
1   2 2019-12   NA
2   2 2019-12   NA
3   3 2018-12    1
4   1 2018-12   NA
5   2 2019-11    1
6   2 2019-10   NA
7   4 2018-13    1
8   2 2018-13   NA
9   3 2018-12    1
10  1 2019-11   NA

To check the results we can compare them with the results obtained from merge:

flagged_only <- merge(dt1, dt2)

flagged_only
  ID    Date
1  2 2019-11
2  3 2018-12
3  3 2018-12
4  4 2018-13

The dataframe flagged_onlycontains exactly the same four rows as the ones flagged 1 in dt1-- voilà!

Upvotes: 0

Related Questions