Reputation: 25
I have a data frame of unique character vectors that are all very similar to a distinct pattern, but with small deviations in each. I'm hoping to find a way to identify what the deviation is in each string. Here is what I have tried:
library(stringr)
#The strings are concatenated in my code, I separated them for easier use
KeyPattern <- c('abcd'
uniqchars <- function(x) unique(strsplit(x, "")[[1]])
KayPattern <- uniqchars(KeyPattern)
> KeyPattern
[1] "a" "b" "c" "d"
SampleString <- c('a', 'b', 'z', 'c', 'd')
str_detect(SampleString, KeyPattern)
[1] TRUE TRUE FALSE FALSE FALSE
As you can see, it recognizes the 'z' character, and correctly returns FALSE, and from there the pattern is completely off. I also considered trying:
word(string, start = 1L, end = start, sep = fixed(" "))
but this requires a pre-existing knowledge of where the deviations are (start = ..., end = ...) and it will be different in every row of the data frame.
Ultimately I want to have a data frame with one column of unique string, a column of distinct deviations (mismatches in the pattern), and it's location in the string.
Goal Sample Table:
String | Deviation from Key | Deviation start location |
---|---|---|
'a' 'b' 'c' 'z' 'd' | z | 4 |
'a' 'b' 'a' 'c' 'd' | a | 3 |
Current concatenated data frame:
1 ASGGGGSAASHLIALQLRLIGDAFDGGGGSGGGGSG
2 ASLTVDVGNVTYHFNNPITVLVFAILVALELGGTVHVHGNRIHVEG
3 ASLTVHVGDLTYHFENPQLVKLVAEIWARALNLTIEIRGNEIHVEG
4 ASNELVELVVEILYRMCVDPDQIKKILKRRGVSDEEVKRAIDKAIG
5 ASNMNMLEALQQRLQFYFGVVSRAALENNSGKARRFGRIVKQYEDAIKLYKAGKPVPYDELPVPPGFGG
6 ASNTIMLEALQQRLQFYFGVVSRAALENNSGKARRFGRIVKQYEDAIKLYKAGKPVPYDELPVPPGFGG
#CurrentKey
[1] "ASSTNMLEALQQRLQFYFGVVSRALENNSGKARRFGRIVKQYEDAIKLYKAGKPVPYDELPVPPGFGG"
Any suggestions?
Upvotes: 1
Views: 159
Reputation: 2626
Here is my approach:
First, define a recursive function:
find_deviation <- function(string, key, position = 1) {
stopifnot(is.character(string), is.character(key))
if (min(length(key), length(string)) == 0)
return(c(deviation = NA, position = NA))
if (string[1] != key[1])
return(c(deviation = string[1], position = position))
find_deviation(string[-1], key[-1], position + 1)
}
Then, use it to generate the desired result:
dplyr::bind_cols(
purrr::map_dfr(SampleString, ~ c(String = paste(.x, collapse = ","))),
purrr::map_dfr(SampleString, ~ find_deviation(.x, KeyPattern))
)
Result:
# A tibble: 2 x 3
String deviation position
<chr> <chr> <chr>
1 a,b,z,c,d z 3
2 a,b,a,c,d a 3
Data used:
KeyPattern <- c('a', 'b', 'c', 'd')
SampleString <- list(c('a', 'b', 'z', 'c', 'd'), c('a', 'b', 'a', 'c', 'd'))
Upvotes: 1
Reputation: 10375
Using aphid library and sequence alignment, the character vectors are combined into a list, the first element being the key pattern vector.
library(aphid)
KeyPattern <- c('a', 'b', 'c', 'd')
SampleString1 <- c('a', 'b', 'z', 'c', 'd')
SampleString2 <- c('a', 'b', 'c', 'z', 'd')
SampleString3 <- c('a', 'b', 'a', 'c', 'd')
sequences=list(KeyPattern,SampleString1,SampleString2,SampleString3)
do.call(rbind,
sapply(2:length(sequences),function(x){
glo=align(sequences[c(1,x)],type="global",k=1)
tmp=glo[1,]!=glo[2,]
data.frame(
"String"=paste0(sequences[[x]],collapse=" "),
"Deviation from Key"=glo[2,tmp],
"Deviation start location"=which(tmp)
)
},simplify=F)
)
String Deviation.from.Key Deviation.start.location
1 a b z c d z 3
2 a b c z d z 4
3 a b a c d a 3
Upvotes: 0
Reputation: 26218
see if this what you want?
df <- structure(list(STRINGS = list(c("a", "b", "c", "z", "d"), c("a",
"b", "a", "c", "d"))), class = "data.frame", row.names = c(NA,
-2L))
df
#> STRINGS
#> 1 a, b, c, z, d
#> 2 a, b, a, c, d
pattern <- c('a', 'b', 'c', 'd')
library(tidyverse)
df %>%
mutate(deviation = map_chr(STRINGS, ~ {x <- cumsum(.x[seq_along(pattern)] != pattern); .x[which(x >0)[1]]}),
deviation_start_loc = map_int(STRINGS, ~ {x <- cumsum(.x[seq_along(pattern)] != pattern); which(x > 0)[1]}))
#> STRINGS deviation deviation_start_loc
#> 1 a, b, c, z, d z 4
#> 2 a, b, a, c, d a 3
Created on 2021-06-21 by the reprex package (v2.0.0)
Upvotes: 2