Reputation: 767
I have a table of bank account transactions and I want to add a column flagging any transactions which are refunds (both the initial payment and the money coming back in) but for the life of me I can't figure out how to match the rows.
Sample data:
transactions <- tibble(date = as.Date(c("2020-01-01", "2020-01-02", "2020-01-02", "2020-01-03", "2020-01-04")),
merchant = c("A", "B", "B", "C", "A"),
amount = c(10, 5, -5, 9, 10))
Desired output:
# A tibble: 5 x 4
date merchant amount refund
<date> <chr> <dbl> <lgl>
1 2020-01-01 A 10 FALSE
2 2020-01-02 B 5 TRUE
3 2020-01-02 B -5 TRUE
4 2020-01-03 C 9 FALSE
5 2020-01-04 A 10 FALSE
At the minute I'm only looking to match payments where the refund is on the same day, but I might change it later to include refunds that occurred any day after the payment. My first thought was to use something like below but it doesn't ensure that the values all occur in the same row. I'd prefer a tidyverse answer.
mutate(refund = if_else(date %in% date &
merchant %in% merchant &
(amount*-1) %in% amount,
T, F))
Upvotes: 0
Views: 55
Reputation: 767
I had a brainwave while doing something else, it's a longer process than I was envisioning but posting my working solution here for completeness.
First I added a unique ID for each row (in reality my data already has this)
transactions <- tibble(date = as.Date(c("2020-01-01", "2020-01-02", "2020-01-02", "2020-01-03", "2020-01-04")),
merchant = c("A", "B", "B", "C", "A"),
amount = c(10, 5, -5, 9, 10)) %>%
mutate(id = seq_len(nrow(.)))
Split the dataframe into two, depending on if money is going out (positive value) or out (negative) of the account. For money in, I mutated the amount column to be positive values instead so they can be matched.
transactions_out <- transactions %>%
filter(amount > 0)
transactions_in <- transactions %>%
filter(amount < 0) %>%
mutate(amount = amount*-1)
Use inner_join()
to match rows with the same date, merchant and amount in both rows. Then use pivot_longer()
and pull()
to save the IDs of these transactions.
In my actual data I've added extra lines here to refine the matches to reduce false positives.
refund_ids <- inner_join(transactions_out, transactions_in,
by = c("date", "merchant", "amount"),
suffix = c("_out", "_in")) %>%
pivot_longer(starts_with("id"),
names_to = "type",
values_to = "id") %>%
pull(id)
Finally, I add the new refund column to the original dataframe for the IDs in refund_ids
:
transactions <- transactions %>%
mutate(refund = if_else(id %in% refund_ids, T, F))
Output:
> transactions
# A tibble: 5 x 5
date merchant amount id refund
<date> <chr> <dbl> <int> <lgl>
1 2020-01-01 A 10 1 FALSE
2 2020-01-02 B 5 2 TRUE
3 2020-01-02 B -5 3 TRUE
4 2020-01-03 C 9 4 FALSE
5 2020-01-04 A 10 5 FALSE
Upvotes: 1
Reputation: 698
One solution could be to group by the columns which identify matching transactions, in this case date
and merchant
. This would remove the need for the x %in% y
syntax you have in the if_else
.
Using case_when()
because it gives the flexibility to refine the logic down the line. if_else
would be equivalent (or faster?) in the 2 case situation.
transactions %>%
group_by(date, merchant) %>%
# This logic says:
# When the sum of the amounts DO NOT equal to 0, it is NOT a refund
mutate(refund = case_when(sum(amount) != 0 ~ F,
T ~ T))
# A tibble: 5 x 4
# Groups: date, merchant [4]
date merchant amount refund
<date> <chr> <dbl> <lgl>
1 2020-01-01 A 10 FALSE
2 2020-01-02 B 5 TRUE
3 2020-01-02 B -5 TRUE
4 2020-01-03 C 9 FALSE
5 2020-01-04 A 10 FALSE
Upvotes: 1