Detect discrepancies between two sequences

Question

I have two time series vectors: complete_data and incomplete_data. the data in the vector consists of 6 possible events which occur randomly throughout the vector. In principle the two should be the same because with every event in complete_data, that same event was then added on to incomplete_data. however in reality there were some anomalies in the system and not all of the events in complete_data were sent to incomplete_data. Thus complete_data is longer than incomplete_data. I need to find the differences in the pattern between the two and mark them. I made an attempt but it assumes that the discrepancy between the two vectors occurs in a single chunk, whereas in reality, there are various "missing events" scattered in incomplete_data.

Here is my attempt:

complete_data <- c('a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c')
dfcomplete <- as.data.frame(complete_data)
incomplete_data <- c('a', 'b', 'c', 'a','c', 'a', 'b', 'a', 'b', 'c')
dfincomplete <- as.data.frame(incomplete_data)

findMatch <- function(complete_data, incomplete_data){

  matching_inorder <- NULL
  matching_reverseorder <- NULL

  for (i in 1:length(complete_data)){
    matching_inorder[i] <- complete_data[i] == incomplete_data[i]
    matching_reverseorder[i] <- rev(complete_data)[i] == rev(incomplete_data)[i]
  }

  is_match <- ifelse(matching_inorder == FALSE & 
                       rev(matching_reverseorder) == FALSE, 'non_match', 'match')
  is_match
}

dfcomplete$is_match_incorrect <- findMatch(dfcomplete$complete_data,
                                 dfincomplete$incomplete_data)

And here is what I would like to get:

dfcomplete$expected_output <- c('match', 'match', 'match', 'match', 'non-match', 'match',
                 'match', 'match', 'non_match', 'match', 'match', 'match')

In reality my data is much larger than these examples with many different discrepancies scattered throughout the vector. Though there aren't necessarily too many discrepancies to make the task meaningless, for example, in one case the complete vector has 320 datapoints whilst the incomplete vector has 309.

Any help that can be offered would be much appreciated.

Julius Vainora · Accepted Answer

There are various ways to do this, but here's a recursive one, where x is assumed to be a complete sequence and y incomplete.

compare <- function(x, y) {
  if (length(x) > 0) {
    if (x[1] == y[1]) {
      x[1] <- "match"
      c(x[1], compare(x[-1], y[-1]))
    } else {
      x[1] <- "no match"
      c(x[1], compare(x[-1], y))
    }
  }
}
compare(complete_data, incomplete_data)
# [1] "match"    "match"    "match"    "match"    "no match" "match"   
# [7] "match"    "match"    "no match" "match"    "match"    "match"

Another one that perhaps is more readable and uses a simple loop would be

out <- rep(NA, length(incomplete_data))
gap <- 0
for(i in seq_along(complete_data)) {
  if (complete_data[i] == incomplete_data[i - gap]) {
    out[i] <- "match"
  } else {
    out[i] <- "no match"
    gap <- gap + 1
  }
}
out
# [1] "match"    "match"    "match"    "match"    "no match" "match"   
# [7] "match"    "match"    "no match" "match"    "match"    "match"

Detect discrepancies between two sequences

Answers (2)

Related Questions