vitor
vitor

Reputation: 1250

String matching with wildcards, trying Biostrings package

Given the string patt:

patt = "AGCTTCATGAAGCTGAGTNGGACGCGATGATGCG"

We can make a collection of shorter substrings str_col:

str_col = substring(patt,1:(nchar(patt)-9),10:nchar(patt))

which we want to match against a subject1:

subject1 = "AGCTTCATGAAGCTGAGTGGGACGCGATGATGCGACTAGGGACCTTAGCAGC"

treating "N" in patt as a wildcard (match to any letter in subject1), so all substrings in str_col match to subject1.

I want to do this kind of string matching in a large database of strings, and I found the Bioconductor package Biostrings be very efficient to do that. But, in order to be efficient, Biostrings requires you to convert your collection of substrings (here str_col) into a dictionary of class pdict using the function PDidc(). You can use this 'dictionary' later in functions like countPDict() to count matches against a target.

In order to use wildcards, you have to divide your dictionary in 3 parts: a head (left), a trusted band (middle) and a tail (right). You can only have wildcards, like "N", in the head or tail, but not in the trusted band, and you cannot have a trusted band of width = 0. So, for example, str_col[15] won't match if you use a trusted band of minimum width = 1 like:

> PDict(str_col[1:15],tb.start=5,tb.end=5)
Error in .Call2("ACtree2_build", tb, pp_exclude, base_codes, nodebuf_ptr,  :
    non base DNA letter found in Trusted Band for pattern 15

because the "N" is right in the trusted band. Notice that the strings here are DNA sequences, so "N" is a code for "match to A, C, G, or T".

> PDict(str_col[1:14],tb.start=5,tb.end=5) #is OK
TB_PDict object of length 14 and width 10 (preprocessing algo="ACtree2"):
    - with a head of width 4
    - with a Trusted Band of width 1
    - with a tail of width 5

Is there any way to circumvent this limitation of Biostrings? I also tried to perform such task using R base functions, but I couldn't come up with anything.

Upvotes: 1

Views: 403

Answers (1)

Joris Meys
Joris Meys

Reputation: 108533

I reckon that you'll need matching against some other wild cards from the IUPAC ambiguity code at one point, no?

If you need perfect matches and base functions are enough for you, you can use the same trick as the function glob2rx() : simply use conversion with gsub() to construct the matching patterns. An example:

IUPACtoRX <- function(x){
  p <- gsub("N","\\[ATCG\\]",x)
  p <- gsub("Y","\\[CT\\]",p)  #match any pyrimidine
  # add the ambiguity codes you want here
  p
}

Obviously you need a line for every ambiguity you want to program in, but it's pretty straightforward I'd say.

Doing this, you can then eg do something like:

> sapply(str_col, function(i) grepl(IUPACtoRX(i),subject1) )
AGCTTCATGA GCTTCATGAA CTTCATGAAG TTCATGAAGC TCATGAAGCT CATGAAGCTG ATGAAGCTGA 
      TRUE       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE 
TGAAGCTGAG GAAGCTGAGT AAGCTGAGTN AGCTGAGTNG GCTGAGTNGG CTGAGTNGGA TGAGTNGGAC 
      TRUE       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE 
GAGTNGGACG AGTNGGACGC GTNGGACGCG TNGGACGCGA NGGACGCGAT GGACGCGATG GACGCGATGA 
      TRUE       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE 
ACGCGATGAT CGCGATGATG GCGATGATGC CGATGATGCG 
      TRUE       TRUE       TRUE       TRUE 

To find the number of matches, you can use eg gregexpr():

> sapply(str_col,function(i) sum(gregexpr(IUPACtoRX(i),subject1) > 0 ))
AGCTTCATGA GCTTCATGAA CTTCATGAAG TTCATGAAGC TCATGAAGCT CATGAAGCTG ATGAAGCTGA 
         1          1          1          1          1          1          1 
TGAAGCTGAG GAAGCTGAGT AAGCTGAGTN AGCTGAGTNG GCTGAGTNGG CTGAGTNGGA TGAGTNGGAC 
         1          1          1          1          1          1          1 
GAGTNGGACG AGTNGGACGC GTNGGACGCG TNGGACGCGA NGGACGCGAT GGACGCGATG GACGCGATGA 
         1          1          1          1          1          1          1 
ACGCGATGAT CGCGATGATG GCGATGATGC CGATGATGCG 
         1          1          1          1

Upvotes: 1

Related Questions