Identify sequences in alphanumeric strings in R

I am attempting to create a flag for when transaction IDs are sequential. For reasons that I will not get into here, these can be a red flag. The problem that I am having is that the IDs are not standardized. Some can be numeric only and others are alphanumeric. I want to identify cases where a vendor's transactions are sequential for X number (how many in a row TBD) of transactions. The data will be grouped by vendor and I'd like to extract all rows and columns for instances where a sequence is present. A much simplified example below:

df <- read.table(text=
"Vendor        'Transaction ID'
 ACME               1
 ACME               2
 ACME               3
 JDOE               A1
 JDOE               A6
 JDOE               A10
 XYZ                B12
 XYZ                B13
 XYZ                B14", header=TRUE)

In this instance I would want to extract the Vendor and Transaction ID for all ACME and all XYZ rows since they are in a sequence. I've done some research but not found an approach that seems to fit what I'm trying to do.

Upvotes: 1

Answers (3)

Mark

Reputation: 12548

My earlier version used the stringr and as.numeric, but I really like the parse_number() function from Andrew's answer, so I am ruthlessly stealing it ^_^

library(tidyverse)

df %>%
  filter(Transaction.ID %>% 
          parse_number() %>%
          diff() %>%
          {all(. == 1)}, .by = Vendor)

Upvotes: 1

jpsmith

Reputation: 17646

Packages are more elegant, but for good measure, a base R approach would be to use tapply to create a boolean after looking for sequential numbers by groups, then [indexing]:

xx <- tapply(df$Transaction.ID, df$Vendor, function(x) 
  length(rle(diff(as.numeric(gsub("[A-Za-z]", '\\1', x))))$lengths) == 1)

# ACME  JDOE   XYZ 
# TRUE FALSE  TRUE 

df[df$Vendor %in% names(xx[xx]),]

#  Vendor Transaction.ID
#1   ACME              1
#2   ACME              2
#3   ACME              3
#7    XYZ            B12
#8    XYZ            B13
#9    XYZ            B14

Upvotes: 1

AndrewGB

Reputation: 16866

One option is to convert the Transaction_ID to a number using parse_number, then you can create a grouping column for consecutive values within each group. Then, you can filter depending on how many consecutive numbers you want to have. Here, I use greater than 1, but you can update the filter statement depending on your desired criteria. Then, I ungroup and remove the extra columns - Transaction_ID2 and grp.

library(dplyr)
library(readr)

df %>%
  mutate(Transaction_ID2 = parse_number(Transaction_ID)) %>%
  group_by(Vendor, grp = cumsum(c(1, diff(Transaction_ID2) != 1))) %>%
  filter(n() > 1) %>% 
  ungroup %>% 
  select(-c(Transaction_ID2, grp))

Output

  Vendor Transaction_ID
  <chr>  <chr>         
1 ACME   1             
2 ACME   2             
3 ACME   3             
4 XYZ    B12           
5 XYZ    B13           
6 XYZ    B14

Upvotes: 4

Identify sequences in alphanumeric strings in R

Answers (3)

Related Questions