zanmatt
zanmatt

Reputation: 35

Assign an ID based on multiple criteria

I am new to R and am trying to put together a script to automate a now very manual task of triangulating different reports.

In my job I receive reports from different sources, which I need to triangulate and aggregate if needed. To simplify (and anonymise) my example let's say that I get reports about sales made by different merchants in a market. This data includes "Observer", "Seller", "Buyer", and "Date of sale".

Example:

market <- data.frame(observer=c("Tom", "Fred", "Hank", "Tom"), 
seller=c("A", "A", "B", "A"),
buyer=c("X", "X", "Y", "X"),
date_sale=c("2017/01/01", "2017/01/03", "2017/01/04", "2017/01/05"))

Now, some of this data might overlap, so I need to make sure that I know if a transaction has been reported already across merchants in a similar time period (+/- 7 days) and assign the same ID to it (so later I can merge the two). If, however, the same observer reports the same transaction again a short time later I can assume that in that case it's a separate one.

In my example, I can see that Tom and Fred both reported a purchase from A to X within 2 days of each other, while Tom reported a second one in the same time period. So ideally R should give the same ID to the first two transactions and a separate one to the third.

The result should be:

market <- data.frame(observer=c("Tom", "Fred", "Hank", "Tom"), 
seller=c("A", "A", "B", "A"),
buyer=c("X", "X", "Y", "X"),
date_sale=c("2017/01/01", "2017/01/03", "2017/01/04", "2017/01/05"),
id=c(1, 1, 2, 3))

I have tried with getanID from the splitstackshape package, but I cannot manage to find out how to give a parameter of "within +/- 7 days of an earlier transaction". I am open to any suggestions, thank you very much!

Upvotes: 2

Views: 98

Answers (1)

Artem Sokolov
Artem Sokolov

Reputation: 13691

For completeness sake, I added another data point to your data.frame, that is more than 7 days away. I also converted your dates to the correct class to simplify date arithmetic:

market <- data.frame(observer=c("Tom", "Fred", "Hank", "Tom", "Joe"),
                 seller=c("A", "A", "B", "A", "A"),
                 buyer=c("X", "X", "Y", "X", "X"),
                 date_sale=as.Date(c("2017/01/01", "2017/01/03",
                        "2017/01/04","2017/01/05", "2017/01/09")) )

The first step you'll want to do is bin your data into 7-day bins:

library( dplyr )    # We'll make extensive use of this package
m1 <- market %>% mutate( date_bin = as.integer((date_sale - min(date_sale)) / 7) )
#   observer seller buyer  date_sale date_bin
# 1      Tom      A     X 2017-01-01        0
# 2     Fred      A     X 2017-01-03        0
# 3     Hank      B     Y 2017-01-04        0
# 4      Tom      A     X 2017-01-05        0
# 5      Joe      A     X 2017-01-09        1  

The final id is going to be a product of two "sub-IDs": the outer ID which comes from grouping your data by date_bin, seller and buyer (i.e., what are all the possible versions of transactions that can happen within a 7-day period), and the inner ID which enumerates duplicate transactions made by the same observer within each group.

The two IDs can be computed as follows:

i1 <- m1 %>% group_by( date_bin, seller, buyer ) %>% group_indices()
m2 <- m1 %>% mutate( outID = i1 ) %>% group_by( outID, observer ) %>%
        mutate( inID = 1:n() )
#   observer seller buyer  date_sale date_bin outID inID
# 1      Tom      A     X 2017-01-01        0     1    1
# 2     Fred      A     X 2017-01-03        0     1    1
# 3     Hank      B     Y 2017-01-04        0     2    1
# 4      Tom      A     X 2017-01-05        0     1    2
# 5      Joe      A     X 2017-01-09        1     3    1 

Finally, we create the final id from all the unique pairs of outID and inID:

market %>% mutate( id = group_by( m2, outID, inID ) %>% group_indices() )
#   observer seller buyer  date_sale id
# 1      Tom      A     X 2017-01-01  1
# 2     Fred      A     X 2017-01-03  1
# 3     Hank      B     Y 2017-01-04  3
# 4      Tom      A     X 2017-01-05  2
# 5      Joe      A     X 2017-01-09  4

Note that the indices are not exactly in the same order as what you requested in your question, but since these are arbitrary integers, you can reassign them to desired values without loss of generality.

Upvotes: 1

Related Questions