stlba
stlba

Reputation: 767

Mutate if multiple values occur in additional rows together

I have a table of bank account transactions and I want to add a column flagging any transactions which are refunds (both the initial payment and the money coming back in) but for the life of me I can't figure out how to match the rows.

Sample data:

transactions <- tibble(date = as.Date(c("2020-01-01", "2020-01-02", "2020-01-02", "2020-01-03", "2020-01-04")),
                       merchant = c("A", "B", "B", "C", "A"),
                       amount = c(10, 5, -5, 9, 10))

Desired output:

# A tibble: 5 x 4
  date       merchant amount refund
  <date>     <chr>     <dbl> <lgl> 
1 2020-01-01 A            10 FALSE 
2 2020-01-02 B             5 TRUE  
3 2020-01-02 B            -5 TRUE  
4 2020-01-03 C             9 FALSE 
5 2020-01-04 A            10 FALSE

At the minute I'm only looking to match payments where the refund is on the same day, but I might change it later to include refunds that occurred any day after the payment. My first thought was to use something like below but it doesn't ensure that the values all occur in the same row. I'd prefer a tidyverse answer.

mutate(refund = if_else(date %in% date & 
                             merchant %in% merchant & 
                             (amount*-1) %in% amount,
                        T, F))

Upvotes: 0

Views: 55

Answers (2)

stlba
stlba

Reputation: 767

I had a brainwave while doing something else, it's a longer process than I was envisioning but posting my working solution here for completeness.

First I added a unique ID for each row (in reality my data already has this)

transactions <- tibble(date = as.Date(c("2020-01-01", "2020-01-02", "2020-01-02", "2020-01-03", "2020-01-04")),
                       merchant = c("A", "B", "B", "C", "A"),
                       amount = c(10, 5, -5, 9, 10)) %>%
     mutate(id = seq_len(nrow(.)))

Split the dataframe into two, depending on if money is going out (positive value) or out (negative) of the account. For money in, I mutated the amount column to be positive values instead so they can be matched.

transactions_out <- transactions %>%
     filter(amount > 0)

transactions_in <- transactions %>%
     filter(amount < 0) %>%
     mutate(amount = amount*-1)

Use inner_join() to match rows with the same date, merchant and amount in both rows. Then use pivot_longer() and pull() to save the IDs of these transactions.

In my actual data I've added extra lines here to refine the matches to reduce false positives.

refund_ids <- inner_join(transactions_out, transactions_in,
           by = c("date", "merchant", "amount"),
           suffix = c("_out", "_in")) %>%
     pivot_longer(starts_with("id"),
                  names_to = "type",
                  values_to = "id") %>%
     pull(id)

Finally, I add the new refund column to the original dataframe for the IDs in refund_ids:

transactions <- transactions %>%
     mutate(refund = if_else(id %in% refund_ids, T, F))

Output:

> transactions

# A tibble: 5 x 5
  date       merchant amount    id refund
  <date>     <chr>     <dbl> <int> <lgl> 
1 2020-01-01 A            10     1 FALSE 
2 2020-01-02 B             5     2 TRUE  
3 2020-01-02 B            -5     3 TRUE  
4 2020-01-03 C             9     4 FALSE 
5 2020-01-04 A            10     5 FALSE 

Upvotes: 1

MokeEire
MokeEire

Reputation: 698

One solution could be to group by the columns which identify matching transactions, in this case date and merchant. This would remove the need for the x %in% y syntax you have in the if_else.

Solution

Using case_when() because it gives the flexibility to refine the logic down the line. if_else would be equivalent (or faster?) in the 2 case situation.

transactions %>% 
    group_by(date, merchant) %>% 
    # This logic says:
    # When the sum of the amounts DO NOT equal to 0, it is NOT a refund
    mutate(refund = case_when(sum(amount) != 0 ~ F,
                              T ~ T))

Output

# A tibble: 5 x 4
# Groups:   date, merchant [4]
  date       merchant amount refund
  <date>     <chr>     <dbl> <lgl> 
1 2020-01-01 A            10 FALSE 
2 2020-01-02 B             5 TRUE  
3 2020-01-02 B            -5 TRUE  
4 2020-01-03 C             9 FALSE 
5 2020-01-04 A            10 FALSE 

Upvotes: 1

Related Questions