Jaume Ramon
Jaume Ramon

Reputation: 61

Identifying repeated sequences of numbers in time series

I have a long time series where I need to identify and flag repeated sequences of values in R. Let's suppose I have the following vector:

a <- c(1,2,3,4,88,443,756,2,453,6,21,98,1,2,3,4,65)

Note that the sequence 1,2,3,4 is repeated at the beginning and almost at the end. I want to identify and flag sequences of n (n can be set) repeated numbers in a long time series. That's why I need a powerful method to do that.

Thanks a bunch.

Upvotes: 5

Views: 949

Answers (2)

eamonn
eamonn

Reputation: 101

If you have exactly repeated patterns, this is just O(n). (just hash the sequences and look for collisions)

If you have almost repeated patterns (and are measuring the similarity by Euclidean Distance or correlation), then this is O(N^2), but the Matrix Profile algorithm(s) are very fast [a].

[a] http://www.cs.ucr.edu/~eamonn/MatrixProfile.html

Upvotes: 1

pogibas
pogibas

Reputation: 28369

You can use this function:

identRptSeq <- function(x, N = 4) {
    # Create groups to split input vector in
    splits <- ceiling(seq_along(x) / N)
    # Use data.table shift to create overlapping windows
    foo <- lapply(data.table::shift(x, 0:(N-1), type = "lead"), function(x) {
                  res <- split(x, splits)
                  res[lengths(res) == N]})
    foo <- na.omit(t(as.data.frame(foo)))
    # Find duplicated windows
    foo[duplicated(foo), ]
}

# OPs input
a <- c(1,2,3,4,88,443,756,2,453,6,21,98,1,2,3,4,65)

# Duplicated sequence when N = 4
identRptSeq(a, 4)
[1] 1 2 3 4

# Duplicated sequences when N = 3
identRptSeq(a, 3)
     [,1] [,2] [,3]
X5      1    2    3
X5.1    2    3    4

PS, have in mind that it doesn't work when N = 1 (there are other methods in R for that)

Upvotes: 1

Related Questions