Reputation: 2234

clustering strings - what algorithm is suitable?

I have some strings and characters will not be repeated in a single string. for example: "AABC" is not possible.

I want to cluster them into sets by their common sub-strings. for example: "ABC, CDF, GHP" will be cluster into two sets {ABC,CDF},{GHP}.

several strings with one or more common sub-strings will be in one set. a string which has no common sub-string with any other strings will be a set itself. so keep the number of sets smallest.

for example: 1. "ABC, AHD,AKJ,LAN,WER" will be two sets {ABC, AHD,AKJ,LAN},{WER}. 2. "ABC,BDF, HLK, YHT,PX" will be 3 sets {ABC,BDF}.{HLK, YHT},{PX}.

Finding a string which has nothing common with others is easy I think;

for(i=0; i< strings.num; i++)
{  str1 = strings[i];
     bool m_com=false;
     for(j=0;j < strings.num; j++ )
     {
      str2=strings[j];
      if(hascommon(str1,str2))
         m_com=true;
     }
   if(!m_com)
   {
     str1 has no common substring with any string,
   }
}

now I am thinking about others, how to classify them, is there any algorithm suitable for this?

Input: strings (characters are not be repeated)

output: sets (keep number of sets as small as possible)

I know this involves with finding common sub-string problem and clustering. but I am not familiar with clustering techniques, so I am hoping some one could recommend me such algorithm.

while I am looking for good ways to do this, I also appreciate suggestions from others.

Tip: actually these strings are simple paths between two points in a graph. I want to find the edge whose removal cuts all these paths. the number of such edges should be minimum. so, for AB,BC,CD, it means a single path ABCD exist. and I write down a algorithm to find common substrings in my case(my case much simpler). I think I might use this algorithm during the clustering to measure similarities.

I might have two paths, {ABC, ADC}, both removing A or removing B could split the paths. or I could have {ABC, ADC,HG}, so removing {A,H}, or {CH}, or {CG},or {AG} all works.

I thought I could solve this by finding common subs-strings, then I decide where to remove edges.

Upvotes: 0

Answers (3)

David Frank

Reputation: 6092

As opposed to WhatsUp, I assume you want any two strings in a subset to have a common substring. This means that for AB, BC, CD, {AB, BC, CD} is not a valid solution, because AB and CD do not have a common substring.

As Whatsup already pointed out, you can represent your strings as a graph, where vertices are the strings and and edge goes from one to the other if they have a common character.

If we are not accepting chains (as described at the beginning), the problem becomes finding a minimum clique cover, which is unfortunately NP-complete.

Upvotes: 0

WhatsUp

Reputation: 1636

One thing should be pointed out first:

For any two strings, "having common substring" is really equivalent to "having common letter". Thus we can replace the condition by "having common letter".

Consider the graph G whose vertices are the strings, and two strings are connected by an edge if and only if they have a common letter. Then you are really asking for separate the graph G into connected components. This can be done easily, using standard graph operation algorithms, c.f. the wiki page here.

What remains is the task of establishing the graph. This is also easy: first, create 26 boxes, labelled A to Z, and read each string once. If the string contains letter A, then put it (or its index) into box A, etc. Finally, those strings inside one box have edges connecting to each other.

There can be further optimizations, but I guess it will depend on the nature of your input data.

Upvotes: 1

Thanga

Reputation: 8161

You have to use Heap's algorithm for your job to create permutations https://en.wikipedia.org/wiki/Heap's_algorithm

Upvotes: 0

clustering strings - what algorithm is suitable?

Answers (3)

Related Questions