user13042372
user13042372

Reputation: 63

Count substring matches within start/stop positions of pattern in R

From a given string " GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT", I want a matrix that counts triplets in substrings that

  1. start with “AAA” or “GAA”
  2. end with “AGT” and
  3. have at least 2 and at most N other triplets between the start and the end.

For my problem n=10; So I have below code and result:

library( stringr )
#sample data
dna <- c("GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT")
#set constants
start <- c("AAA", "GAA")
end <- "AGT"

n <- 10  # << set as desired

#build regex
regex <- paste0( "(", paste0( start, collapse = "|" ), ")", paste0( "([A-Z]{3}){2,", n, "}" ), end )
str_extract_all( dna, regex )

Result:

[[1]]
[1] "GAACCCACTAGTATAAAATTTGGGAGT"    "AAACCCTTTGGGAGT"   

Now how to modify the code so that it returns a matrix containing the starting position of each match along with the actual number of "triplet(every 3 characters will be considered as a triplet)" between the start and the end for each match.

Like for the above results, the result should be:

 1  7       
 31 3         

here 1 is the starting position for GAACCCACTAGTATAAAATTTGGGAGT and 7 is the number of triplet between starting pattern GAA and ending pattern AGT. Same for AAACCCTTTGGGAGT" 31 is the starting position of AAA and 7 is the number of triplets between AAA and AGT

Upvotes: 1

Views: 73

Answers (2)

r2evans
r2evans

Reputation: 160447

Here's a function that might give you what I think you need, though it doesn't provide the matrix you said you wanted.

func <- function(dna, n = 10, starts, stops) {
  if (length(n) == 1L) n <- c(2L, n)

  startptn <- paste0("(", paste(starts, collapse = "|"), ")")
  stopptn <- paste0("(", paste(stops, collapse = "|"), ")")

  starts_ind <- gregexpr(startptn, dna)
  stops_ind <- gregexpr(stopptn, dna)
  # stops_ind ends on the first char of the triplet, so add 2
  stops_ind <- lapply(stops_ind, `+`, 2L)

  candidates <- Map(function(bgn, end, txt) {
    mtx <- outer(bgn, end, FUN = function(b, e) substring(txt, b, e))
    vec <- mtx[nzchar(mtx)]
    vec
  }, starts_ind, stops_ind, dna) 

  # "6L" is the first/last triplet defined in starts/stops
  cand_triplets <- lapply(candidates, function(z) nchar(z) - 6L)

  lens <- lengths(candidates)
  df <- data.frame( id = rep(seq_along(dna), lens), dna = rep(dna, lengths(candidates)) )
  df$match <- unlist(candidates)
  df$inner <- substring(df$match, 4, nchar(df$match) - 3)
  df$ntriplets <- nchar(df$inner) / 3

  if (nrow(df) > 0) {
    df <- df[ abs(df$ntriplets %% 1) < 1e-5 &
                n[1] <= df$ntriplets &
                df$ntriplets <= n[2], , drop = FALSE ]
  }
  df
}

Demo:

dna <- c("GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT")

func(c(dna, dna), 10, c("AAA","GAA"), "AGT")
#    id                                           dna                       match                 inner ntriplets
# 1   1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT                GAACCCACTAGT                CCCACT         2
# 2   1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGT CCCACTAGTATAAAATTTGGG         7
# 6   1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT             AAACCCTTTGGGAGT             CCCTTTGGG         3
# 7   2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT                GAACCCACTAGT                CCCACT         2
# 8   2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGT CCCACTAGTATAAAATTTGGG         7
# 12  2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT             AAACCCTTTGGGAGT             CCCTTTGGG         3

func(c(dna, dna), 20, c("AAA","GAA"), "AGT")
#    id                                           dna                                         match                                   inner ntriplets
# 1   1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT                                  GAACCCACTAGT                                  CCCACT         2
# 2   1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT                   GAACCCACTAGTATAAAATTTGGGAGT                   CCCACTAGTATAAAATTTGGG         7
# 4   1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT CCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGG        13
# 6   1 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT                               AAACCCTTTGGGAGT                               CCCTTTGGG         3
# 7   2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT                                  GAACCCACTAGT                                  CCCACT         2
# 8   2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT                   GAACCCACTAGTATAAAATTTGGGAGT                   CCCACTAGTATAAAATTTGGG         7
# 10  2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT CCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGG        13
# 12  2 GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT                               AAACCCTTTGGGAGT                               CCCTTTGGG         3

In the input:

  • each input dna can produce 0 or more rows in the output; I use c(dna, dna) merely to demonstrate that this works on multiple dna strings, vectorized;
  • if n is length 1, then it defaults to ranging between 2 and n; if length 2, then both are used appropriately;

In the output:

  • id numbers each of the dna input strings;
  • dna is the actual string; I include both id and dna in case a string is accidentally repeated (minor);
  • match is the matching substring, including the start and stop triplets;
  • inner is the same substring with the start/stop triplets removed;
  • ntriplets is really just the nchar of $inner;
  • the code filters to ensure we have triplets (/3 reduces it, and %% 1 should be 0), and then how many triplets we have based on the input n

(If you want to see all matches, set n to Inf, though it'll still filter out non-triplet inner strings.)


As to your request of a matrix of lengths, if we insert a browser() in the Map and form mtx, we'll see that we are using matrices:

bgn
# [1]  1 15 31
# attr(,"match.length")
# [1] 3 3 3
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
end
# [1] 12 27 45
# attr(,"match.length")
# [1] 3 3 3
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
txt
# [1] "GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT"

mtx <- outer(bgn, end, FUN = function(b, e) substring(txt, b, e))
mtx
#      [,1]           [,2]                          [,3]                                           
# [1,] "GAACCCACTAGT" "GAACCCACTAGTATAAAATTTGGGAGT" "GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT"
# [2,] ""             "AAAATTTGGGAGT"               "AAAATTTGGGAGTCCCAAACCCTTTGGGAGT"              
# [3,] ""             ""                            "AAACCCTTTGGGAGT"                              
nchar(mtx) - 6
#      [,1] [,2] [,3]
# [1,]    6   21   39
# [2,]   -6    7   25
# [3,]   -6   -6    9

(The negatives are just an artifact of the debugging environment and my naïve subtraction of 6 with empty strings present; this does not reflect a bug in the function.)

To me, this matrix suggests we have 2, 7, and 13 triplets in the top row.

Upvotes: 1

Lars
Lars

Reputation: 1

i have an ugly solution. You just need to analysis your pattern.

library( stringr )
#sample data
dna <- c("GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT")
#set constants
start <- c("AAA", "GAA")
end <- "AGT"

n <- 10  # << set as desired

#build regex
regex <- paste0( "(", paste0( start, collapse = "|" ), ")", paste0( "([A-Z]{3}){2,", n, "}" ), end )
triplet_string <- str_extract_all( dna, regex )

triplet_matrix <- str_locate_all(dna, regex)

# first solution:
# differenz between end and start position (-1) divide by 3 for triplet and substract the first and last triplet 
triplet_length_1 <- (triplet_matrix[[1]][,2] - (triplet_matrix[[1]][,1]-1))/3 - 2 
df <- data.frame(startpos = triplet_matrix[[1]][,1],
                 lengthtriplet = triplet_length_1)

# second solution:
# getting the length of each ORF and substract 2 for start and stopp codon

triplet_length_2 <- nchar(triplet_string[[1]])/3 - 2
df <- data.frame(startpos = triplet_matrix[[1]][,1],
                 lengthtriplet = triplet_length_2)

I hope that i could help you.

Upvotes: 0

Related Questions