danio
danio

Reputation: 8653

How can I detect common substrings in a list of strings

Given a set of strings, for example:

EFgreen
EFgrey
EntireS1
EntireS2
J27RedP1
J27GreenP1
J27RedP2
J27GreenP2
JournalP1Black
JournalP1Blue
JournalP1Green
JournalP1Red
JournalP2Black
JournalP2Blue
JournalP2Green

I want to be able to detect that these are three sets of files:

Are there any known ways of approaching this problem - any published papers I can read on this?

The approach I am considering is for each string look at all other strings and find the common characters and where differing characters are, trying to find sets of strings that have the most in common, but I fear that this is not very efficient and may give false positives.

Note that this is not the same as 'How do I detect groups of common strings in filenames' because that assumes that a string will always have a series of digits following it.

Upvotes: 49

Views: 32067

Answers (9)

Steve Faiwiszewski
Steve Faiwiszewski

Reputation: 313

There are many solutions proposed that solve the general case of finding common substrings. However, the problem here is more specialized. You're looking for common prefixes, not just substrings. This makes it a little simpler. A nice explanation for finding longest common prefix can be found at http://www.geeksforgeeks.org/longest-common-prefix-set-1-word-by-word-matching/

So my proposed "pythonese" pseudo-code is something like this (refer to the link for an implementation of find_lcp:

def count_groups(items):
  sorted_list = sorted(items)

  prefix = sorted_list[0]
  ctr = 0
  groups = {}
  saved_common_prefix = ""
  for i in range(1, sorted_list):
    common_prefix = find_lcp(prefix, sorted_list[i])
    if len(common_prefix) > 0: #we are still in the same group of items
      ctr += 1
      saved_common_prefix = common_prefix
      prefix = common_prefix
    else: # we must have switched to a new group
      groups[saved_common_prefix] = ctr
      ctr = 0
      saved_common_prefix = ""
      prefix = sorted_list[i]
  return groups

Upvotes: 1

Ryan Flynn
Ryan Flynn

Reputation: 231

Great question! The steps to a solution are:

  1. tokenizing input
  2. using tokens to build an appropriate data structure. a DAWG is ideal, but a Trie is simpler and a decent starting point.
  3. optional post-processing of the data structure for simplification or clustering of subtrees into separate outputs
  4. serialization of the data structure to a regular expression or similar syntax

I've implemented this approach in regroup.py. Here's an example:

$ cat | ./regroup.py --cluster-prefix-len=2
EFgreen
EFgrey
EntireS1
EntireS2
J27RedP1
J27GreenP1
J27RedP2
J27GreenP2
JournalP1Black
JournalP1Blue
JournalP1Green
JournalP1Red
JournalP2Black
JournalP2Blue
JournalP2Green
^D
EFgre(en|y)
EntireS[12]
J27(Green|Red)P[12]
JournalP[12](Bl(ack|ue)|(Green|Red))

Upvotes: 13

Sebastiaan van den Broek
Sebastiaan van den Broek

Reputation: 6351

There are many many approaches to string similarity. I would suggest taking a look at this open-source library that implements a lot of metrics like Levenshtein distance.

http://sourceforge.net/projects/simmetrics/

Upvotes: 2

Vikrant Sagar
Vikrant Sagar

Reputation: 71

try "frak" . It creates regex expression from set of strings. Maybe some modification of it will help you. https://github.com/noprompt/frak

Hope it helps.

Upvotes: 3

Riyar Ajay
Riyar Ajay

Reputation: 11

import java.util.*;
class StringProblem
{
    public List<String> subString(String name)
    {
        List<String> list=new ArrayList<String>(); 
        for(int i=0;i<=name.length();i++)
        {
           for(int j=i+1;j<=name.length();j++)
           {
               String s=name.substring(i,j);
               list.add(s);
           }
        }
        return list;
    }
    public String commonString(List<String> list1,List<String> list2,List<String> list3)
    {
        list2.retainAll(list1);
        list3.retainAll(list2);

        Iterator it=list3.iterator();
        String s="";
        int length=0;
        System.out.println(list3);   // 1 1 2 3 1 2 1
        while(it.hasNext())    
        {
           if((s=it.next().toString()).length()>length)
           {
              length=s.length();
           }
        }
        return s;
    }
    public static void main(String args[])
    {
        Scanner sc=new Scanner(System.in);
        System.out.println("Enter the String1:");
        String name1=sc.nextLine();
        System.out.println("Enter the String2:");
        String name2=sc.nextLine();
        System.out.println("Enter the String3:");
        String name3=sc.nextLine();
      //  String name1="salman";
      //  String name2="manmohan";
      //  String name3="rahman";

        StringProblem  sp=new StringProblem();

        List<String> list1=new ArrayList<String>();
        list1=sp.subString(name1);

        List<String> list2=new ArrayList<String>();
        list2=sp.subString(name2);


        List<String> list3=new ArrayList<String>();
        list3=sp.subString(name3);

        sp.commonString(list1,list2,list3);
        System.out.println(" "+sp.commonString(list1,list2,list3));
    }
}

Upvotes: 0

Lucas Wiman
Lucas Wiman

Reputation: 11257

You should be able to achieve this with generalized suffix trees: look for long paths in the suffix tree which come from multiple source strings.

Upvotes: 1

Jeremy Bourque
Jeremy Bourque

Reputation: 3543

I would start here: http://en.wikipedia.org/wiki/Longest_common_substring_problem

There are links to supplemental information in the external links, including Perl implementations of the two algorithms explained in the article.

Edited to add:

Based on the discussion, I still think Longest Common Substring could be at the heart of this problem. Even in the Journal example you reference in your comment, the defining characteristic of that set is the substring 'Journal'.

I would first consider what defines a set as separate from the other sets. That gives you your partition to divide up the data, and then the problem is in measuring how much commonality exists within a set. If the defining characteristic is a common substring, then Longest Common Substring would be a logical starting point.

To automate the process of set detection, in general, you will need a pairwise measure of commonality which you can use to measure the 'difference' between all possible pairs. Then you need an algorithm to compute the partition that results in the overall lowest total difference. If the difference measure is not Longest Common Substring, that's fine, but then you need to determine what it will be. Obviously it needs to be something concrete that you can measure.

Bear in mind also that the properties of your difference measurement will bear on the algorithms that can be used to make the partition. For example, assume diff(X,Y) gives the measure of difference between X and Y. Then it would probably be useful if your measure of distance was such that diff(A,C) <= diff(A,B) + diff(B,C). And obviously diff(A,C) should be the same as diff(C,A).

In thinking about this, I also begin to wonder whether we could conceive of the 'difference' as a distance between any two strings, and, with a rigorous definition of the distance, could we then attempt some kind of cluster analysis on the input strings. Just a thought.

Upvotes: 13

Pasi Savolainen
Pasi Savolainen

Reputation: 2500

For this particular example of strings to keep it extremely simple consider using simple word/digit -separation.

A non-digit sequence apparently can begin with capital letter (Entire). After breaking all strings into groups of sequences, something like

[Entire][S][1]
[Entire][S][2]
[J][27][Red][P][1]
[J][27][Green][P][1]
[J][27][Red][P][2]
....
[Journal][P][1][Blue]
[Journal][P][1][Green]

Then start grouping by groups, you can fairly soon see that prefix "Entire" is a common for some group and that all subgroups have S as headgroup, so only variable for those is 1,2. For J27 case you can see that J27 is only leaf, but that it then branches at Red and Green.

So somekind of List<Pair<list, string>> -structure (composite pattern if I recall correctly).

Upvotes: 0

redtuna
redtuna

Reputation: 4600

Something like that might work.

  1. Build a trie that represents all your strings.

In the example you gave, there would be two edges from the root: "E" and "J". The "J" branch would then split into "Jo" and "J2".

  1. A single strand that forks, e.g. E-n-t-i-r-e-S-(forks to 1, 2) indicates a choice, so that would be EntireS[1,2]

  2. If the strand is "too short" in relation to the fork, e.g. B-A-(forks to N-A-N-A and H-A-M-A-S), we list two words ("banana, bahamas") rather than a choice ("ba[nana,hamas]"). "Too short" might be as simple as "if the part after the fork is longer than the part before", or maybe weighted by the number of words that have a given prefix.

  3. If two subtrees are "sufficiently similar" then they can be merged so that instead of a tree, you now have a general graph. For example if you have ABRed,ABBlue,ABGreen,CDRed,CDBlue,CDGreen, you may find that the subtree rooted at "AB" is the same as the subtree rooted at "CD", so you'd merge them. In your output this will look like this: [left branch, right branch][subtree], so: [AB,CD][Red,Blue,Green]. How to deal with subtrees that are close but not exactly the same? There's probably no absolute answer but someone here may have a good idea.

I'm marking this answer community wiki. Please feel free to extend it so that, together, we may have a reasonable answer to the question.

Upvotes: 3

Related Questions