clear.choi
clear.choi

Reputation: 855

Pattern search algorithm

I have few letters and want to find pattern which happen often.

I have below letters:

 ---------TAAA-GAGAG----T--T-------T
 -------------T------A---------T----
 ----------AAA-GAGAG---C-----------T
 ------C------------T-A----T-----TA-
 -----AC----------------TT--------C-
 -------------------T---------------
 ---A---------------T---------------
 -------------------------T----T----
 ------C---------------------T-----T
 ----------AAA-GAGAG---C------------
 ----A--------------------T-------A-
 --------G-G-----------G-------T----
 ----A--------------------T---------
 ---------T-------------------------
 -----AC------T--------------------T
 --GA-G------------GT--------------T
 -----A--G-AAA-GAGAG-AA-------------
 -----------------------------------
 -T-------C-G-------T---TT------T--T
 TT-----------------T---------------
 -------------------T---TT-T--------
 ---A----G------------A--------T----
 -----------------------------------
 -------T--AA--G-GAG---C------------
 -T-A----------------A--------------
 ------------T------------------T---
 -----------G-----------------------
 --G-A------------------C---T---T---
 ----A---G---A-------A------T-------
 --------G-------------------T-----T
 -TG--------A---------A-T-----------
 --G--A--------GAGAG---CT-----------
 ---A-------G------------T----G---A-
 T-------G----T---------------------
 -T----C-GCAA--GAGAG-A-C-----T--TT--
 -----A----AAA-GAGAG-A-T------G---A-
 -T---------G-----------------------
 ---A---T---------------------G-----
 ---A---------T-------A---T---A---A-
 -----------------------------------
 TC--A----T----------G-------T-T--G-
 -T----CT---G-T-----T-A----T-T--T---
 -------------T-----TA------------A-
 --G----T-----------------T-T--T----
 ---A------AA--GAGAG---C-----T------
 --------------GAGAG-A-C------------
 ----------AA--GAGAG---C-G----------

As you can see there is a pattern in the middle: "GAGAG", so I can say that G+A+G+A+G is coming from same data.

I want to split out that lines and group.

Wrong Case. First line There is T+A+A+A but another line does not together with T+A+A+A , only A+A+A always together. So in this case T and A is not group.

Anybody has idea this kind of related algorithm or how to find a pattern.

I try to do with Java Programming.

Thank you.

Update -----------------------------------------

1) Is there minimal required length for pattern? - Two. and doesn't have to be continuous position.

2) Could "-" symbol be part of pattern? - No. "-" mean this position can be any character like "undefined".

3) Should pattern be positioned on exactly same index within each line? - Yes (But some line could be -)

4) what is minimal number for pattern occur( is 2 occurences enough?) - This is have not been decided yet just want to find most occurences position first and will find best occurences later.

Upvotes: 2

Views: 175

Answers (3)

mcdowella
mcdowella

Reputation: 19601

You could create a https://en.wikipedia.org/wiki/Suffix_array and https://en.wikipedia.org/wiki/LCP_array and look in the Least Common Prefix array to find long common prefixes, which mean that a substring has occurred again.

Upvotes: 2

Nick Johnson
Nick Johnson

Reputation: 167

You could use this algorithm I threw together on a line-by-line basis, then iterate throught the patterns and check if they match other lines patterns.

Pattern[] getPatternsForLine(String line){
ArrayList<Pattern> patterns=new ArrayList<Pattern>();
String temp=""; //what we will store the current pattern int
boolean inPattern=false;
for(char character:line.toCharArray()){
    if(!"-".equals(character)){
        temp=temp+character;
    }else{
        if(inPattern){
            boolean existsInLine=false;//already exists?
            for(Pattern p:patterns){
                if(p.pattern.equals(temp)){
                    p.occurences++;
                    existsInLine=true;
                    break;
                }
            }
            if(!existsInLine)
                patterns.add(new Pattern(temp));
        }
        inPattern=false;
        temp="";
    }
}
return patterns.toArray();
}

class Pattern{
    public String pattern;
    public int occurences;
    public Pattern(String pattern){
        this.pattern=pattern;
        occurences=1;
    }
}

Upvotes: 1

Dmitry
Dmitry

Reputation: 1283

I think you should use trie structure to solve this task. I'll describe algorithm to cover all substrings starting from index 0 - then same algorithm could be applied to check substring starting from other indexes.

Steps:

  1. Initialize empty trie
  2. For each line initialize set with references to trie nodes (initially every set has reference to trie root).
  3. For each line initialize array to track current index to check
  4. Pick Next line (initially first line). If all line process - go to 7)
  5. Check if current index to track is not "-" if so - for every reference in set attached to current line(mentioned above) create sub-node for current letter of this line with counter 1 (if node already exist - increase counter by 1). Capture all new/update nodes into set. Then replace set attached to line to new set and increase index for line and go back to point 4).
  6. Check if current index to track is "-". If so skip the line.
  7. Do for every line skipped as part of instruction 6:
    • Increase all subnodes for nodes in set attached to current lines.
    • Form new set of subnodes increase.
    • increase index for current line

Note: algorithm could be stopped once you process up to certain depth (say 3 symbols). Node in trie with highest value will be most common pattern. Of couse you want to take the longest one with same length.

Complexity:

O(NMA^K) where N - number of lines, M - number of letter in each line, A - alphabet size (e.g. number of different letters encountered) K - max pattern length to check.

Upvotes: 1

Related Questions