pd441
pd441

Reputation: 2763

Detect discrepancies between two sequences

I have two time series vectors: complete_data and incomplete_data. the data in the vector consists of 6 possible events which occur randomly throughout the vector. In principle the two should be the same because with every event in complete_data, that same event was then added on to incomplete_data. however in reality there were some anomalies in the system and not all of the events in complete_data were sent to incomplete_data. Thus complete_data is longer than incomplete_data. I need to find the differences in the pattern between the two and mark them. I made an attempt but it assumes that the discrepancy between the two vectors occurs in a single chunk, whereas in reality, there are various "missing events" scattered in incomplete_data.

Here is my attempt:

complete_data <- c('a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c')
dfcomplete <- as.data.frame(complete_data)
incomplete_data <- c('a', 'b', 'c', 'a','c', 'a', 'b', 'a', 'b', 'c')
dfincomplete <- as.data.frame(incomplete_data)

findMatch <- function(complete_data, incomplete_data){

  matching_inorder <- NULL
  matching_reverseorder <- NULL

  for (i in 1:length(complete_data)){
    matching_inorder[i] <- complete_data[i] == incomplete_data[i]
    matching_reverseorder[i] <- rev(complete_data)[i] == rev(incomplete_data)[i]
  }

  is_match <- ifelse(matching_inorder == FALSE & 
                       rev(matching_reverseorder) == FALSE, 'non_match', 'match')
  is_match
}

dfcomplete$is_match_incorrect <- findMatch(dfcomplete$complete_data,
                                 dfincomplete$incomplete_data)

And here is what I would like to get:

dfcomplete$expected_output <- c('match', 'match', 'match', 'match', 'non-match', 'match',
                 'match', 'match', 'non_match', 'match', 'match', 'match')

In reality my data is much larger than these examples with many different discrepancies scattered throughout the vector. Though there aren't necessarily too many discrepancies to make the task meaningless, for example, in one case the complete vector has 320 datapoints whilst the incomplete vector has 309.

Any help that can be offered would be much appreciated.

Upvotes: 2

Views: 71

Answers (2)

Nicolas2
Nicolas2

Reputation: 2210

If you can afford having event names only one letter long, here is a solution using string matching. The trick is to transform the incomplete data to a pattern including places to insert new characters.

complete_data <- c('a', 'b', 'c', 'a', 'B', 'c', 'a', 'b', 'C', 'a', 'b', 'c')
dfcomplete <- as.data.frame(complete_data,stringsAsFactors=FALSE)
incomplete_data <- c('a', 'b', 'c', 'a','c', 'a', 'b', 'a', 'b', 'c')

y <- paste0('^(.*)',paste(incomplete_data,collapse='(.*)'),'(.*)$')
x <- paste(complete_data,collapse="")
z <- str_length(str_match(x,y)[-1])

data.frame(incomplete_data=c("",incomplete_data),stringsAsFactors=FALSE) %>%
  mutate(n=ifelse(incomplete_data=="",z,z+1)) %>%
  filter(n>0) %>%
  uncount(n) %>%
  mutate(incomplete_data=ifelse(str_detect(rownames(.),"\\."),"",incomplete_data)) %>%
  bind_cols(dfcomplete) %>%
  mutate(match=complete_data==incomplete_data)
#   incomplete_data complete_data match
#1                a             a  TRUE
#2                b             b  TRUE
#3                c             c  TRUE
#4                a             a  TRUE
#5                              B FALSE
#6                c             c  TRUE
#7                a             a  TRUE
#8                b             b  TRUE
#9                              C FALSE
#10               a             a  TRUE
#11               b             b  TRUE
#12               c             c  TRUE

Upvotes: 1

Julius Vainora
Julius Vainora

Reputation: 48211

There are various ways to do this, but here's a recursive one, where x is assumed to be a complete sequence and y incomplete.

compare <- function(x, y) {
  if (length(x) > 0) {
    if (x[1] == y[1]) {
      x[1] <- "match"
      c(x[1], compare(x[-1], y[-1]))
    } else {
      x[1] <- "no match"
      c(x[1], compare(x[-1], y))
    }
  }
}
compare(complete_data, incomplete_data)
# [1] "match"    "match"    "match"    "match"    "no match" "match"   
# [7] "match"    "match"    "no match" "match"    "match"    "match" 

Another one that perhaps is more readable and uses a simple loop would be

out <- rep(NA, length(incomplete_data))
gap <- 0
for(i in seq_along(complete_data)) {
  if (complete_data[i] == incomplete_data[i - gap]) {
    out[i] <- "match"
  } else {
    out[i] <- "no match"
    gap <- gap + 1
  }
}
out
# [1] "match"    "match"    "match"    "match"    "no match" "match"   
# [7] "match"    "match"    "no match" "match"    "match"    "match" 

Upvotes: 3

Related Questions