Reputation: 855
I have few letters and want to find pattern which happen often.
I have below letters:
---------TAAA-GAGAG----T--T-------T
-------------T------A---------T----
----------AAA-GAGAG---C-----------T
------C------------T-A----T-----TA-
-----AC----------------TT--------C-
-------------------T---------------
---A---------------T---------------
-------------------------T----T----
------C---------------------T-----T
----------AAA-GAGAG---C------------
----A--------------------T-------A-
--------G-G-----------G-------T----
----A--------------------T---------
---------T-------------------------
-----AC------T--------------------T
--GA-G------------GT--------------T
-----A--G-AAA-GAGAG-AA-------------
-----------------------------------
-T-------C-G-------T---TT------T--T
TT-----------------T---------------
-------------------T---TT-T--------
---A----G------------A--------T----
-----------------------------------
-------T--AA--G-GAG---C------------
-T-A----------------A--------------
------------T------------------T---
-----------G-----------------------
--G-A------------------C---T---T---
----A---G---A-------A------T-------
--------G-------------------T-----T
-TG--------A---------A-T-----------
--G--A--------GAGAG---CT-----------
---A-------G------------T----G---A-
T-------G----T---------------------
-T----C-GCAA--GAGAG-A-C-----T--TT--
-----A----AAA-GAGAG-A-T------G---A-
-T---------G-----------------------
---A---T---------------------G-----
---A---------T-------A---T---A---A-
-----------------------------------
TC--A----T----------G-------T-T--G-
-T----CT---G-T-----T-A----T-T--T---
-------------T-----TA------------A-
--G----T-----------------T-T--T----
---A------AA--GAGAG---C-----T------
--------------GAGAG-A-C------------
----------AA--GAGAG---C-G----------
As you can see there is a pattern in the middle: "GAGAG", so I can say that G+A+G+A+G is coming from same data.
I want to split out that lines and group.
Wrong Case. First line There is T+A+A+A but another line does not together with T+A+A+A , only A+A+A always together. So in this case T and A is not group.
Anybody has idea this kind of related algorithm or how to find a pattern.
I try to do with Java Programming.
Thank you.
Update -----------------------------------------
1) Is there minimal required length for pattern? - Two. and doesn't have to be continuous position.
2) Could "-" symbol be part of pattern? - No. "-" mean this position can be any character like "undefined".
3) Should pattern be positioned on exactly same index within each line? - Yes (But some line could be -)
4) what is minimal number for pattern occur( is 2 occurences enough?) - This is have not been decided yet just want to find most occurences position first and will find best occurences later.
Upvotes: 2
Views: 175
Reputation: 19601
You could create a https://en.wikipedia.org/wiki/Suffix_array and https://en.wikipedia.org/wiki/LCP_array and look in the Least Common Prefix array to find long common prefixes, which mean that a substring has occurred again.
Upvotes: 2
Reputation: 167
You could use this algorithm I threw together on a line-by-line basis, then iterate throught the patterns and check if they match other lines patterns.
Pattern[] getPatternsForLine(String line){
ArrayList<Pattern> patterns=new ArrayList<Pattern>();
String temp=""; //what we will store the current pattern int
boolean inPattern=false;
for(char character:line.toCharArray()){
if(!"-".equals(character)){
temp=temp+character;
}else{
if(inPattern){
boolean existsInLine=false;//already exists?
for(Pattern p:patterns){
if(p.pattern.equals(temp)){
p.occurences++;
existsInLine=true;
break;
}
}
if(!existsInLine)
patterns.add(new Pattern(temp));
}
inPattern=false;
temp="";
}
}
return patterns.toArray();
}
class Pattern{
public String pattern;
public int occurences;
public Pattern(String pattern){
this.pattern=pattern;
occurences=1;
}
}
Upvotes: 1
Reputation: 1283
I think you should use trie structure to solve this task. I'll describe algorithm to cover all substrings starting from index 0 - then same algorithm could be applied to check substring starting from other indexes.
Steps:
Note: algorithm could be stopped once you process up to certain depth (say 3 symbols). Node in trie with highest value will be most common pattern. Of couse you want to take the longest one with same length.
Complexity:
O(NMA^K) where N - number of lines, M - number of letter in each line, A - alphabet size (e.g. number of different letters encountered) K - max pattern length to check.
Upvotes: 1