Reputation: 279

R indexing string with character blocks denoting nucleotide variants

My problem is I need to find a position in a string where I have blocks of characters which should really only be a single character position. I am working with nucleotide sequences where I need to keep track of positions within the sequence, but I have some positions where there are variants which have been denoted as [A/T] where either an A or T could be present depending on which sequence I care about (this is two similar DNA sequences which vary at a couple positions throughout the sequence). So for every one of these variant sites, the length of the sequence is an extra four characters/positions longer.

I know I could get around this by making a new code where [A/T] can be converted to, say X and [T/A] is represented by Y, but this will get confusing because there is already a standard degeneracy code, but it won't keep track of which nucleotide is from which strain (for me the one before the / is from strain A and the one after the / is from strain B). I want to index this DNA sequence string somehow, I was thinking like this below:

If I have a string like:

dna <- "ATC[A/T]G[G/C]ATTACAATCG"

I would like to get a table/data.frame:

pos nuc
1   A
2   T
3   C
4   [A/T]
5   G
6   [G/C]
... and so on

I feel like I could use strplit somehow if I knew regex better. Can I insert a condition to split at every character unless bound by square brackets which should be kept as a block?

Upvotes: 6

Answers (4)

hwnd

Reputation: 70732

I'm the type of person that likes to keep things simple, here's a short trick ...

x <- 'ATC[A/T]G[G/C]ATTACAATCG'
data.frame(nuc = regmatches(x, gregexpr('\\[[^]]*]|.', x))[[1]])

#      nuc
# 1      A
# 2      T
# 3      C
# 4  [A/T]
# 5      G
# 6  [G/C]
# 7      A
# 8      T
# 9      T
# 10     A
# 11     C
# 12     A
# 13     A
# 14     T
# 15     C
# 16     G

The above regular expression uses alternation, on the left-hand side we match the substrings that are inside square brackets, on the right-hand side we use . which matches any single character.

Upvotes: 5

Pierre L

Reputation: 28441

library('stringr')
df <- as.data.frame(strsplit(gsub("\\[./.\\]", '_', dna), ''), stringsAsFactors=F)
df[,1][df[,1] == '_'] <- str_extract_all(dna, "\\[./.\\]")[[1]];names(df) <- 'nuc'
df
#      nuc
# 1      A
# 2      T
# 3      C
# 4  [A/T]
# 5      G
# 6  [G/C]
# 7      A
# 8      T
# 9      T
# 10     A
# 11     C
# 12     A
# 13     A
# 14     T
# 15     C
# 16     G

Upvotes: 6

rawr

Reputation: 20811

Here's another

dna <- "ATC[A/T]G[G/C]ATTACAATCG"

(tmp <- gsub('(\\w)(\\w)','~\\1~\\2~', dna))
# [1] "~A~T~C[A/T]G[G/C]~A~T~~T~A~~C~A~~A~T~~C~G~"

(nuc <- Filter(nzchar, strsplit(gsub("(\\[.+?\\])","~\\1~", tmp), '~')[[1]]))
# [1] "A"     "T"     "C"     "[A/T]" "G"     "[G/C]" "A"     "T"     "T"    
# [10] "A"     "C"     "A"     "A"     "T"     "C"     "G"

data.frame(nuc)
#      nuc
# 1      A
# 2      T
# 3      C
# 4  [A/T]
# 5      G
# 6  [G/C]
# 7      A
# 8      T
# 9      T
# 10     A
# 11     C
# 12     A
# 13     A
# 14     T
# 15     C
# 16     G

Upvotes: 3

Chris Watson

Reputation: 1367

So an easy way to get everything aside from the bracketed characters:

strsplit(dna, '\\[[A-Z]/[A-Z]\\]')

[[1]]
[1] "ATC"        "G"          "ATTACAATCG"

Perhaps negating that would give you anything inside brackets, or use the regex in the argument I listed.

EDIT: Here is code that will get you what is in between brackets:

lbracket <- as.numeric(unlist(gregexpr('\\[', dna)))
rbracket <- as.numeric(unlist(gregexpr('\\]', dna)))
mapply(function(x, y) substr(dna, start=x, stop=y), lbracket, rbracket)

[1] "[A/T]" "[G/C]"

That should work.

Upvotes: 1

R indexing string with character blocks denoting nucleotide variants

Answers (4)

Related Questions