Patrick
Patrick

Reputation: 2702

Find the smallest unique substring for each string in an array

(I'm writing this in the context of JavaScript, but will accept an algorithmically correct answer in any language)

How do you find the shortest substring of each element in an array of strings where the substring is NOT contained within any of the other elements, ignoring case?

Suppose I have an input array such as:

var names = ["Anne", "Anthony", "LouAnn", "Kant", "Louise", "ark"];

The output should be something like:

var uniqueNames = ["ne", "h", "ua", "ka", "i", "r"];

For my purposes, you can safely assume that no element will be wholly contained within another element.

My Thoughts:
It seems that one could probably brute force this, along the lines of:

var names = ["Anne", "Anthony", "LouAnn", "Kant", "Louise", "ark"];
var uniqueNames = [], nameInd, windowSize, substrInd, substr, otherNameInd, foundMatch;
// For each name
for (nameInd = 0; nameInd < names.length; nameInd++)
{
    var name = names[nameInd];
    // For each possible substring length
    windowLoop:
    for (windowSize = 1; windowSize <= name.length; windowSize++)
    {
        // For each starting index of a substring
        for (substrInd = 0; substrInd <= name.length-windowSize; substrInd++)
        {
            substr = name.substring(substrInd,substrInd+windowSize).toLowerCase();
            foundMatch = false;
            // For each other name
            for (otherNameInd = 0; otherNameInd < names.length; otherNameInd++)
            {
                if (nameInd != otherNameInd && names[otherNameInd].toLowerCase().indexOf(substr) > -1)
                {
                    foundMatch = true;
                    break;
                }
            }

            if (!foundMatch)
            {
                // This substr works!
                uniqueNames[nameInd] = substr;
                break windowLoop;
            }
        }
    }
}

But I have to imagine there's a more elegant solution using tries/prefix trees, suffix arrays, or something interesting like that.

Edit: I believe this is the form the selected answer would take programmatically in JavaScript:

var names = ["Anne", "Anthony", "LouAnn", "Kant", "Louise", "ark"];
var uniqueNames = [], permutations = {}, permutation, nameInd, windowSize, substrInd, substr;

// For each name
for (nameInd = 0; nameInd < names.length; nameInd++)
{
    var name = names[nameInd];
    // For each possible substring length
    windowLoop:
    for (windowSize = 1; windowSize <= name.length; windowSize++)
    {
        // For each starting index of a substring
        for (substrInd = 0; substrInd <= name.length-windowSize; substrInd++)
        {
            substr = name.substring(substrInd,substrInd+windowSize).toLowerCase();
            permutations[substr] = (typeof permutations[substr] === "undefined")?nameInd:-1;
        }
    }
}

for (substr in permutations)
{
    permutation = permutations[substr];
    if (permutation !== -1 && ((typeof uniqueNames[permutation] === "string" && substr.length < uniqueNames[permutation].length) || typeof uniqueNames[permutation] === "undefined"))
    {
        uniqueNames[permutation] = substr;
    }
}

Upvotes: 12

Views: 16401

Answers (4)

Rohit Katta
Rohit Katta

Reputation: 1

   for(String s : strArr) { //O(n)
      //Assume the given string as shortest and override with shortest
       result.put(s, s);   
       for(int i = 0; i < s.length(); i++) { // O(m)              
          for (int j = i + 1; j <=s.length(); j++) {
               String subStr = s.substring(i, j);
               boolean isValid = true;
               for(String str2: strArr) { // O(n)
                   if(str2.equals(s)) // Same string cannot be a substring
                     continue;
                     
                    if(str2.contains(subStr)) {
                        isValid = false;
                        break;
                    }
               }

               if(isValid && subStr.length() < result.get(s).length()) 
                   result.put(s, subStr);
           }
        } 
   } 
    
   return result;

Upvotes: -1

KWillets
KWillets

Reputation: 31

If you build a generalized suffix tree, you just need to find the shallowest point at which an infix of each string branches off from the infixes of the other strings, and take the label to that branching point plus one "distinguishing" character. The kicker is that there has to be such an extra character (it could be branching off only at the metacharacter stuck on the end of each string), and the branching-off point might not lead to a leaf, it might lead to a subtree with leaves all from the same string (so internal nodes have to be considered).

For each string S find the shallowest (by parent label depth) node N that only contains leaves from S, and whose edge label contains at least one character. The path label from root to the parent of N, plus one character from the edge label leading to N, is the shortest infix of S not found in other strings.

I believe the labeling of nodes that only contain leaves from one string can be done during construction or by O(N) scan of the GST; then it's a simple matter to scan the final tree and keep a running minimum for each string. So it's all O(N).

(edit -- I can't reply to comments yet)

To clarify, every suffix in a suffix tree has a node where it branches off from the other suffixes; the goal here is to find the/a suffix for each string that branches off from suffixes of all the other strings at the minimum depth, as measured by the path label to that node. All we need is one extra character after that point to have a substring that doesn't appear in any other string.

Example:

Strings: abbc, abc

Using Ukonnen's algorithm, after the first string we have a suffix tree of just the suffixes from that string; I'll label them with [1] here:

abbc[1]
b
 bc[1]
 c[1]
c[1]

Next we insert string 2's suffixes:

ab
  bc[1]
  c[2]
b
 bc[1]
 c
  [1]
  [2]
c
 [1]
 [2]

Now we want to find the shortest string that leads to a branch with only [1]'s under it; we can do that by scanning all the [1]'s and looking at their immediate parents, which I will list here by path label, plus one character (which I will use below):

abbc:  abb
bbc: bb
bc: bc[1]
c: c[1]

Note that I've included [1] since it's the metacharacter that distinguishes otherwise identical suffixes of [1] and [2]. This is handy when identifying substrings that repeat in multiple strings, but it's not useful for our problem, since if we delete the [1] we end up with a string that occurs in [2] also, ie it's not a candidate.

Now, none of the labels on the right occurs in any other string, so we choose the shortest one not including a metacharacter, which is bb.

Similarly, the second string has these candidates:

abc: abc
bc: bc[2]
c: c[2]

Only one doesn't have a metacharacter at the end, so we have to go with abc.

My final point is that this per-string minimum-finding doesn't have to happen one-at-a-time; the GST can be scanned once to label nodes as either containing leaves from one string ([1],[2],..[n]) or "mixed", and then the per-string minimum non-shared strings (I would call these "distinguishing infixes") can be calculated in one pass as well.

Upvotes: 3

Mukesh
Mukesh

Reputation: 303

This problem can be solved in O(N*L*L*L) complexity. The approach will be using suffix tries. Each node of the trie will be also storing the prefix count which will refer to the number of times the substring formed while traversing to that node from the root have appeared in all of the suffixes inserted till now.

We will be constructing N+1 tries. The first trie will be global and we will be inserting all the suffixes of all N string into it. The next N tries will be local for each of the N strings containing corresponding suffixes.

This preprocessing step of constructing tries will be done in O(N*L*L).

Now once the tries have been constructed, for each string, we can start quering the number of times a substring ( starting from minimum length) has occured in the global trie and the trie corresponding to that string. If it is same in both then it implies that it is not included in any other strings except itself. This can be achieved in O(N*L*L*L). The complexity can be explained as N for each string, L*L for considering each substring and L for performing query in the trie.

Upvotes: 5

hamstergene
hamstergene

Reputation: 24439

Say N is number of strings and L is maximum length of string. You're doing up to N*L*L*N iterations.

I can only improve it a bit by trading one iteration for extra memory. For each possible substring length (L iterations),

  • enumerate all substrings of that length in each name (N*L), and store it among with name's index into a hashtable (1). If there is already an index for this substring, you know it won't work, then you replace index with some special value, like -1.

  • walk the hashtable, picking up substrings for which index is not -1 — that are the answers for their corresponding indexes, but only use them if that names don't already have a shorter answer from a previous iteration

The memory usage can be greatly reduced by storing reference back into existing string instead of copying substrings.

Upvotes: 2

Related Questions