Reputation: 1250
Given the string patt
:
patt = "AGCTTCATGAAGCTGAGTNGGACGCGATGATGCG"
We can make a collection of shorter substrings str_col
:
str_col = substring(patt,1:(nchar(patt)-9),10:nchar(patt))
which we want to match against a subject1
:
subject1 = "AGCTTCATGAAGCTGAGTGGGACGCGATGATGCGACTAGGGACCTTAGCAGC"
treating "N" in patt
as a wildcard (match to any letter in subject1
), so all substrings in str_col
match to subject1
.
I want to do this kind of string matching in a large database of strings, and I found the Bioconductor
package Biostrings
be very efficient to do that. But, in order to be efficient, Biostrings requires you to convert your collection of substrings (here str_col
) into a dictionary of class pdict
using the function PDidc()
. You can use this 'dictionary' later in functions like countPDict()
to count matches against a target.
In order to use wildcards, you have to divide your dictionary in 3 parts: a head (left), a trusted band (middle) and a tail (right). You can only have wildcards, like "N", in the head or tail, but not in the trusted band, and you cannot have a trusted band of width = 0. So, for example, str_col[15]
won't match if you use a trusted band of minimum width = 1 like:
> PDict(str_col[1:15],tb.start=5,tb.end=5)
Error in .Call2("ACtree2_build", tb, pp_exclude, base_codes, nodebuf_ptr, :
non base DNA letter found in Trusted Band for pattern 15
because the "N" is right in the trusted band. Notice that the strings here are DNA sequences, so "N" is a code for "match to A, C, G, or T".
> PDict(str_col[1:14],tb.start=5,tb.end=5) #is OK
TB_PDict object of length 14 and width 10 (preprocessing algo="ACtree2"):
- with a head of width 4
- with a Trusted Band of width 1
- with a tail of width 5
Is there any way to circumvent this limitation of Biostrings
? I also tried to perform such task using R base functions, but I couldn't come up with anything.
Upvotes: 1
Views: 403
Reputation: 108533
I reckon that you'll need matching against some other wild cards from the IUPAC ambiguity code at one point, no?
If you need perfect matches and base functions are enough for you, you can use the same trick as the function glob2rx()
: simply use conversion with gsub()
to construct the matching patterns. An example:
IUPACtoRX <- function(x){
p <- gsub("N","\\[ATCG\\]",x)
p <- gsub("Y","\\[CT\\]",p) #match any pyrimidine
# add the ambiguity codes you want here
p
}
Obviously you need a line for every ambiguity you want to program in, but it's pretty straightforward I'd say.
Doing this, you can then eg do something like:
> sapply(str_col, function(i) grepl(IUPACtoRX(i),subject1) )
AGCTTCATGA GCTTCATGAA CTTCATGAAG TTCATGAAGC TCATGAAGCT CATGAAGCTG ATGAAGCTGA
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
TGAAGCTGAG GAAGCTGAGT AAGCTGAGTN AGCTGAGTNG GCTGAGTNGG CTGAGTNGGA TGAGTNGGAC
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
GAGTNGGACG AGTNGGACGC GTNGGACGCG TNGGACGCGA NGGACGCGAT GGACGCGATG GACGCGATGA
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
ACGCGATGAT CGCGATGATG GCGATGATGC CGATGATGCG
TRUE TRUE TRUE TRUE
To find the number of matches, you can use eg gregexpr()
:
> sapply(str_col,function(i) sum(gregexpr(IUPACtoRX(i),subject1) > 0 ))
AGCTTCATGA GCTTCATGAA CTTCATGAAG TTCATGAAGC TCATGAAGCT CATGAAGCTG ATGAAGCTGA
1 1 1 1 1 1 1
TGAAGCTGAG GAAGCTGAGT AAGCTGAGTN AGCTGAGTNG GCTGAGTNGG CTGAGTNGGA TGAGTNGGAC
1 1 1 1 1 1 1
GAGTNGGACG AGTNGGACGC GTNGGACGCG TNGGACGCGA NGGACGCGAT GGACGCGATG GACGCGATGA
1 1 1 1 1 1 1
ACGCGATGAT CGCGATGATG GCGATGATGC CGATGATGCG
1 1 1 1
Upvotes: 1