Ayush Goyal
Ayush Goyal

Reputation: 29

Shortest uncommon prefix from a set of strings

Given a string A and a set of string S. Need to find an optimum method to find a prefix of A which is not a prefix of any of the strings in S.

Example

A={apple}

S={april,apprehend,apprehension}

Output should be "appl" and not "app" since "app" is prefix of both "apple" and "apprehension" but "appl" is not.

I know the trie approach; by making a trie of set S and then traversing in the trie for string A.

But what I want to ask is can we do it without trie?

Like can we compare every pair (A,Si), Si = ith string from set S and get the largest common prefix out of them.In this case that would be "app" , so now the required ans would be "appl".

This would take 2 loops(one for iterating through S and another for comparing Si and A).

Can we improve upon this??

Please suggest an optimum approach.

Upvotes: 1

Views: 222

Answers (2)

Bernhard Barker
Bernhard Barker

Reputation: 55609

I'm not sure exactly what you had in mind, but here's one way to do it:

  • Keep a variable longest, initialised to 0.

  • Loop over all elements S[i] of S,
    setting longest = max(longest, matchingPrefixLength(S[i], A)).

  • Return the prefix from A of length longest+1.

This uses O(1) space and takes O(length(S)*average length of S[i]) time.

This is optimal (at least for the worst case) since you can't get around needing to look at every character of every element in S.

Example:

A={apple}
S={april,apprehend,apprehension}

longest = 0

The longest prefix for S[0] and A is 2
So longest = max(0,2) = 2

The longest prefix for S[1] and A is 3
So longest = max(2,3) = 3

The longest prefix for S[2] and A is 3
So longest = max(3,3) = 3

Now we return the prefix of length longest+1 = 4, i.e. "appl"

Note that there are actually 2 trie-based approaches:

  • Store only A in the trie. Iterate through the trie for each element from S to eliminate prefixes.

    This uses much less memory than the second approach (but still more than the approach above). At least assuming A isn't much, much longer than S[i], but you can optimise to stop at the longest element in S or construct the tree as we go to avoid this case.

  • Store all elements from S in the trie. Iterate through the trie with A to find the shortest non-matching prefix.

    This approach is significantly faster if you have lots of A's that you want to query for a constant set S (since you only have to set up the trie once, and do a single lookup for each A, where-as you have to create a new trie and run through each S[i] for each A for the first approach).

Upvotes: 2

Patrick87
Patrick87

Reputation: 28302

What is your input size?

Let's model your input as being of N+1 strings whose lengths are about M characters. Your total input size is about M(N+1) character, plus some proportional amount of apparatus to encode that data in a usable format (data structure overhead).

Your algorithm ...

maxlen = 0
for i = 1 to N
    for j = 1 to M
        if A[j] = S[i][j] then
            if j > maxlen then maxlen = j
            break
print A[1...maxlen]

... performs up M x N iterations of the innermost loop, reading two characters each time, for a total of 2MN characters read.

Recall our input data size was about M(N+1) also. So our question now is whether we can solve this problem, in the worst case, looking at asymptotically less than the total input (you do a little less than looking at all the input twice, or linear in the input size). The answer is no. Consider this worst case:

  • length of A is M'
  • length of all strings in S is M'
  • A differs from N-1 strings in S by the last two characters
  • A differs from 1 string in S by only the last character

Any algorithm must look at M'-1 characters of N-1 strings, plus M' characters of 1 string, to correctly determine the answer of this problem instance is A.

(M'-1)(N'-1) + N = M'N - M' - N + 1 + N = M'N - M' + 1

For N >= 2, the dominant terms in both M'(N+1) and M'N' are both M'N, meaning that for N >= 2, both the input size and the amount of that input any correct algorithm must read is O(MN). Your algorithm is O(MN). Any other algorithm cannot be asymptotically better.

Upvotes: 0

Related Questions