Reputation: 33
I am working in R with some sequential data. Specifically I have a list of integers that appear several times in various sequences. What I am trying to do is to create some code that can identify how many different sequences appear.
Currently, I am doing it manually. I predefine patterns that exist and apply a function that counts the occurrences.
I first use RMYSQL to make the query which is stored in variable product_process_history_joined. Then, I create a list of my data of interest which is stored in the variable data. Then, I define which patterns should my function work on and last I apply my function that counts the number of occurrences.
The code:
product_process_history_joined<-dbGetQuery(con,"SELECT *
FROM product, process_history
WHERE product.idproduct = process_history.product_idproduct")
data<-product_process_history_joined$process_types_idprocess_types
pat <- c(1,2,4,5,6)
x <- sapply(1:(length(data)-length(pat)), function(x) all(data[x: (x+length(pat)-1)] == pat))
route<-data[which(x)]
countR<-length(route)
pat1 <- c(1,2,4,5,7,9,7,7,2,5,6,10)
x <- sapply(1:(length(data)-length(pat1)), function(x) all(data[x: (x+length(pat1)-1)] == pat1))
route1<-data[which(x)]
countR1<-length(route1)
The dataset that is produced and stored in the data variable looks like this:
[1] 1 4 5 6 1 4 5 6 1 4 5 6 1 4 5 6 1 4 5 6 1 4 5 6 1 4 5 6 1 4 5 6 1 4 5
[36] 6 1 4 5 6 1 4 5 6 1 4 5 6 1 4 5 6 1 4 5 6 1 4 5 6 1 4 5 6 1 4 5 6 1 4
[71] 5 6 1 4 5 6 1 4 5 6 1 4 5 6 1 2 4 5 6 10 1 2 4 5 7 9 7 7 2 5 6 10 1 2 4
[106] 5 6 10 1 2 4 5 6 10 1 2 4 8 1 2 3 5 7 8 1 2 3 5 6 1 2 3 5 6 1 2 4 5 6 10
This is a just a subset of the list. I use around 12 different patterns. The results for the first 2 patterns in the given dataset is 21 for pat and 1 for pat1.
Upvotes: 1
Views: 271
Reputation: 132706
There is no reason for regexing. You could use rollapply
:
original_data <- c(1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5,6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 2, 4, 5, 6, 10, 1, 2, 4, 5, 7, 9, 7, 7, 2, 5, 6, 10, 1, 2, 4, 5, 6, 10, 1, 2, 4, 5, 6, 10, 1, 2, 4, 8, 1, 2, 3, 5, 7, 8, 1, 2, 3, 5, 6, 1, 2, 3, 5, 6, 1, 2, 4, 5, 6, 10)
pattern2 <- c(1, 4, 5, 6)
library(zoo)
sum(
rollapply(
original_data,
width = length(pattern2),
FUN = function(x, pattern) all(x == pattern),
pattern = pattern2
)
)
#[1] 21
Faster solutions are possible if necessary, but this offers good readability.
Edit
This extracts all different sequences that start with a 1:
x <- split(original_data, cumsum(original_data == 1))
unique(x)
res <- vapply(unique(x), function(x, y) sum(vapply(y, FUN = identical, y = x, FUN.VALUE = TRUE)), y = x, FUN.VALUE = 1L)
Res <- data.frame(n = res,
seq = vapply(unique(x), paste, collapse = ",", FUN.VALUE = "a"))
# n seq
#1 21 1,4,5,6
#2 4 1,2,4,5,6,10
#3 1 1,2,4,5,7,9,7,7,2,5,6,10
#4 1 1,2,4,8
#5 1 1,2,3,5,7,8
#6 2 1,2,3,5,6
Upvotes: 4
Reputation: 1434
This is definitely not the best way to do the job, but you could decide to treat your data as a string and then use regular expressions (via gregexpr
).
original_data <- c(1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5,6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 4, 5, 6, 1, 2, 4, 5, 6, 10, 1, 2, 4, 5, 7, 9, 7, 7, 2, 5, 6, 10, 1, 2, 4, 5, 6, 10, 1, 2, 4, 5, 6, 10, 1, 2, 4, 8, 1, 2, 3, 5, 7, 8, 1, 2, 3, 5, 6, 1, 2, 3, 5, 6, 1, 2, 4, 5, 6, 10)
data_as_string <- paste(original_data, collapse="-")
pattern1 = "1-2-4-5-6" # Your "pat"
pattern2 = "1-4-5-6" # Occurs 21 times in your data
pattern3 = "1-2-4-5-7-9-7-7-2-5-6-10" # Your "pat1"
gregexpr(pattern1,data_as_string)
# [[1]]
# [1] 169 207 220 273
# attr(,"match.length")
# [1] 9 9 9 9
# attr(,"useBytes")
# [1] TRUE
# So if you just want the number of occurrences
length(gregexpr(pattern1,data_as_string)[[1]])
# [1] 4
length(gregexpr(pattern2,data_as_string)[[1]])
# [1] 21
length(gregexpr(pattern3,data_as_string)[[1]])
# [1] 1
Upvotes: 1