Taylor Beauvais
Taylor Beauvais

Reputation: 25

Extracting a single unique character from a pattern in R

I have a data frame of unique character vectors that are all very similar to a distinct pattern, but with small deviations in each. I'm hoping to find a way to identify what the deviation is in each string. Here is what I have tried:

library(stringr)

#The strings are concatenated in my code, I separated them for easier use

KeyPattern <- c('abcd'
uniqchars <- function(x) unique(strsplit(x, "")[[1]]) 
KayPattern <- uniqchars(KeyPattern)

> KeyPattern
[1] "a" "b" "c" "d"

SampleString <- c('a', 'b', 'z', 'c', 'd')



str_detect(SampleString, KeyPattern)

[1] TRUE TRUE FALSE FALSE FALSE

As you can see, it recognizes the 'z' character, and correctly returns FALSE, and from there the pattern is completely off. I also considered trying:

word(string, start = 1L, end = start, sep = fixed(" "))

but this requires a pre-existing knowledge of where the deviations are (start = ..., end = ...) and it will be different in every row of the data frame.

Ultimately I want to have a data frame with one column of unique string, a column of distinct deviations (mismatches in the pattern), and it's location in the string.

Goal Sample Table:

String Deviation from Key Deviation start location
'a' 'b' 'c' 'z' 'd' z 4
'a' 'b' 'a' 'c' 'd' a 3

Current concatenated data frame:


1                                  ASGGGGSAASHLIALQLRLIGDAFDGGGGSGGGGSG
2                        ASLTVDVGNVTYHFNNPITVLVFAILVALELGGTVHVHGNRIHVEG
3                        ASLTVHVGDLTYHFENPQLVKLVAEIWARALNLTIEIRGNEIHVEG
4                        ASNELVELVVEILYRMCVDPDQIKKILKRRGVSDEEVKRAIDKAIG
5 ASNMNMLEALQQRLQFYFGVVSRAALENNSGKARRFGRIVKQYEDAIKLYKAGKPVPYDELPVPPGFGG
6 ASNTIMLEALQQRLQFYFGVVSRAALENNSGKARRFGRIVKQYEDAIKLYKAGKPVPYDELPVPPGFGG

#CurrentKey
[1] "ASSTNMLEALQQRLQFYFGVVSRALENNSGKARRFGRIVKQYEDAIKLYKAGKPVPYDELPVPPGFGG"

Any suggestions?

Upvotes: 1

Views: 159

Answers (3)

ktiu
ktiu

Reputation: 2626

Here is my approach:

First, define a recursive function:

find_deviation <- function(string, key, position = 1) {
  stopifnot(is.character(string), is.character(key))
  if (min(length(key), length(string)) == 0)
    return(c(deviation = NA, position = NA))
  if (string[1] != key[1])
    return(c(deviation = string[1], position = position))
  find_deviation(string[-1], key[-1], position + 1)
}

Then, use it to generate the desired result:

dplyr::bind_cols(
  purrr::map_dfr(SampleString, ~ c(String = paste(.x, collapse = ","))),
  purrr::map_dfr(SampleString, ~ find_deviation(.x, KeyPattern))
)

Result:

# A tibble: 2 x 3
  String    deviation position
  <chr>     <chr>     <chr>   
1 a,b,z,c,d z         3       
2 a,b,a,c,d a         3       

Data used:

KeyPattern <- c('a', 'b', 'c', 'd')
SampleString <- list(c('a', 'b', 'z', 'c', 'd'), c('a', 'b', 'a', 'c', 'd'))

Upvotes: 1

user2974951
user2974951

Reputation: 10375

Using aphid library and sequence alignment, the character vectors are combined into a list, the first element being the key pattern vector.

library(aphid)

KeyPattern <- c('a', 'b', 'c', 'd')
SampleString1 <- c('a', 'b', 'z', 'c', 'd')
SampleString2 <- c('a', 'b', 'c', 'z', 'd')
SampleString3 <- c('a', 'b', 'a', 'c', 'd')

sequences=list(KeyPattern,SampleString1,SampleString2,SampleString3)

do.call(rbind,
  sapply(2:length(sequences),function(x){
    glo=align(sequences[c(1,x)],type="global",k=1)
    tmp=glo[1,]!=glo[2,]
    data.frame(
      "String"=paste0(sequences[[x]],collapse=" "),
      "Deviation from Key"=glo[2,tmp],
      "Deviation start location"=which(tmp)
    )
  },simplify=F)
)

     String Deviation.from.Key Deviation.start.location
1 a b z c d                  z                        3
2 a b c z d                  z                        4
3 a b a c d                  a                        3

Upvotes: 0

AnilGoyal
AnilGoyal

Reputation: 26218

see if this what you want?

df <- structure(list(STRINGS = list(c("a", "b", "c", "z", "d"), c("a", 
                                                                  "b", "a", "c", "d"))), class = "data.frame", row.names = c(NA, 
                                                                                                                             -2L))

df
#>         STRINGS
#> 1 a, b, c, z, d
#> 2 a, b, a, c, d

pattern <- c('a', 'b', 'c', 'd')

library(tidyverse)

df %>%
  mutate(deviation = map_chr(STRINGS, ~  {x <- cumsum(.x[seq_along(pattern)] != pattern); .x[which(x >0)[1]]}),
         deviation_start_loc = map_int(STRINGS, ~  {x <- cumsum(.x[seq_along(pattern)] != pattern); which(x > 0)[1]}))

#>         STRINGS deviation deviation_start_loc
#> 1 a, b, c, z, d         z                   4
#> 2 a, b, a, c, d         a                   3

Created on 2021-06-21 by the reprex package (v2.0.0)

Upvotes: 2

Related Questions