Reputation: 137
I am attempting to create a flag for when transaction IDs are sequential. For reasons that I will not get into here, these can be a red flag. The problem that I am having is that the IDs are not standardized. Some can be numeric only and others are alphanumeric. I want to identify cases where a vendor's transactions are sequential for X number (how many in a row TBD) of transactions. The data will be grouped by vendor and I'd like to extract all rows and columns for instances where a sequence is present. A much simplified example below:
df <- read.table(text=
"Vendor 'Transaction ID'
ACME 1
ACME 2
ACME 3
JDOE A1
JDOE A6
JDOE A10
XYZ B12
XYZ B13
XYZ B14", header=TRUE)
In this instance I would want to extract the Vendor and Transaction ID for all ACME and all XYZ rows since they are in a sequence. I've done some research but not found an approach that seems to fit what I'm trying to do.
Upvotes: 1
Views: 94
Reputation: 12548
My earlier version used the stringr and as.numeric, but I really like the parse_number()
function from Andrew's answer, so I am ruthlessly stealing it ^_^
library(tidyverse)
df %>%
filter(Transaction.ID %>%
parse_number() %>%
diff() %>%
{all(. == 1)}, .by = Vendor)
Upvotes: 1
Reputation: 17646
Packages are more elegant, but for good measure, a base R approach would be to use tapply
to create a boolean after looking for sequential numbers by groups, then [indexing]:
xx <- tapply(df$Transaction.ID, df$Vendor, function(x)
length(rle(diff(as.numeric(gsub("[A-Za-z]", '\\1', x))))$lengths) == 1)
# ACME JDOE XYZ
# TRUE FALSE TRUE
df[df$Vendor %in% names(xx[xx]),]
# Vendor Transaction.ID
#1 ACME 1
#2 ACME 2
#3 ACME 3
#7 XYZ B12
#8 XYZ B13
#9 XYZ B14
Upvotes: 1
Reputation: 16866
One option is to convert the Transaction_ID
to a number using parse_number
, then you can create a grouping column for consecutive values within each group. Then, you can filter
depending on how many consecutive numbers you want to have. Here, I use greater than 1, but you can update the filter
statement depending on your desired criteria. Then, I ungroup and remove the extra columns - Transaction_ID2
and grp
.
library(dplyr)
library(readr)
df %>%
mutate(Transaction_ID2 = parse_number(Transaction_ID)) %>%
group_by(Vendor, grp = cumsum(c(1, diff(Transaction_ID2) != 1))) %>%
filter(n() > 1) %>%
ungroup %>%
select(-c(Transaction_ID2, grp))
Output
Vendor Transaction_ID
<chr> <chr>
1 ACME 1
2 ACME 2
3 ACME 3
4 XYZ B12
5 XYZ B13
6 XYZ B14
Upvotes: 4