Junjie
Junjie

Reputation: 479

build shortest string from a collection of its subsequence

Given a collection of sub-sequences from a string.

For example:

abc
acd
bcd

The problem is, how to determine the shortest string from these sequences?

For the above example, the shortest string is abcd.

Here sub-sequences means part of a string but not necessarily consecutive. like acd is a sub-sequence of string abcd.

Edit: This problem actually comes from Project Euler problem 79, in that problem, we have 50 subsequences, each has the length of 3. and all characters are digits.

Upvotes: 5

Views: 1829

Answers (5)

Alma Do
Alma Do

Reputation: 37365

Complexity

In common case, as mentioned above, this problem will be NP-hard. There is a way to resolve a matter with suffix structure, but also you can use directed graph to do that. I can not say for sure if that is better in some sense - but may be some advantage may be found in difficult corner cases.

Graph solution

To realize how you can use graph - you need just to build it properly. Let letters of strings be vertexes and order of letter will define edges. That means, ab will mean vertexes a and b and connection between a and b. Of course, connections are directed (i.e. if a is connected to b it doesn't mean that b is connected to a, so ab is a --> b)

After these definitions you'll be able to build all graph. For your sample it's:

Simple regular case

-simple enough. But abcd can also be represented with strings of two length as ab, ac, ad, bc, bd, cd - so I'll show graph for that too (it's just for more clarity):

Complex regular case

Then, to find your solution, you need in this graph to find a path with maximal length. Obviously, there's the place from which NP-hardness is "derived". From both cases above maximum length is 4 which will be reached if we'll start from a vertex and traverse graph till found a->b->c->d path.

Corner cases

Non-unique solution: in fact, you may face string sets which can't strictly define a superset. Example: ab, ad, bd, ac, cd. Both abcd and acbd will fit those substrings. But, actually, this is not too bad problem. Look at the graph:

Ambiguous way case

(I've chosen that with reason: it's like second graph, but without one connection, that's why result is ambiguous). The maximum path length is now 3 - but it can be reached with two paths: abd and acd. So how to restore at least one solution? Simple: since result strings will have same length (that comes from definition of way which we've found them) - we can just walk from start of first string and check symbols in second string. If they're matching, then it's a strict symbol position. But if they not - then we're free to chose any order between current symbols. So that will be:

  • [a] matches [a], so current result is a
  • [b] mismatches [c], so we can place either bc or cb. Let it be first. Then result is abc
  • [d] matches [d], so current result is abc+d, i.e. abcd

This is kind of "merge" where we are free to choose any result. However, this is a bit twisted case, because now we can't use just any found path - we should find all paths with maximum length (otherwise we'll not be able to restore full supersequence)

Next, non-existent solution: there may be cases when order of symbols in some strings can not be used to reproduce supersequence. Obviously, that means that one order is violating other order, thus, there will be two strings in which some two symbols have different order. Again, simplest case: abc, cbd. On the graph:

Non-existent solution

-so imminent consequence will be graph loop - it may not be so simple (like in graph above) - but it always be if order is broken. Thus, all you need to realize that is to find graph cycle. In fact. this isn't a thing that you must do separately. You'll just add cycle detection in graph longest path search.

Third: repeated symbols: this is the most tricky case. If string contains repeated symbols, then graph creation is more complicated, but still can be done. For example, let we have three conditions: aac, abc, bbc. Solution for this is aabbc. How to build graph? Now we can't just add links (because of loops at least). But I suggest following way:

  • Let we process all our strings, assigning indexes to symbols. Index is reset per string. If symbol appears only once, index will be 1. For our strings that means: a1a2c1, a1b1c1 and b1b2c1. After that, store maximum index for each symbol we've found. For sample above that will be 2 for a, 2 for b and 1 for c
  • If two connected indexed symbols have same original symbol, then connection is done "as is". For example, a1a2 will produce only one connection from a1 to a2
  • If two connected indexed symbols have different original symbols, then any first indexed symbol may be connected to any second indexed symbol. For example, b1c1 will result in (b1 to c1) connection and (b2 to c1) connection. How do we know about how many indexed symbols may be? That we've done on first step (i.e. found maximum indexes)

Thus, graph for sample above will look like:

          +------------+
          |            |
          |            |
 +------>[a2]------+   |
 |        |        |   |
 |        V        V   |
[a1]---->[c1]<----[b2] |
 |        ^        ^   |
 |        |        |   |
 +------>[b1]------+   |
          ^            |
          |            |
          +------------+

-so we'll have maximum length 5 for paths a1a2b2b1c1 and a1a2b1b2c1. We can chose any of them, ignoring indexes in result string aabbc.

Upvotes: 3

Vikram Bhat
Vikram Bhat

Reputation: 6246

Construct the graph out of the subsequences and then find the topological sort of the graph in O(E) using DFS to get the desired string of shortest length which has all subsequence in it. But as you would have noticed the topo-sort is invalid if the graph has cycles in that cases there are repetitions need for characters in the cycles which is difficult to solve.

Conclusion:- You get lucky if there are no cycles in graph and solve it in O(E) else get unlucky and end up doing brute force.

Upvotes: 1

notbad
notbad

Reputation: 2887

Maybe there is no general efficient algorithm this kind of problem. But this is a solution for this particular problemprojecteuler 79. You want to observe the input file. You will find 7 only appears at the beginning of all sequences, so 7 should be put at the beginning. Then you will find digit 3 is the only character at the beginning, then you put 3 on the second position. So on and so forth, you can get 73162890. The special case is the last 2 digits, both 0 and 9 are at the beginning, then you have 2 choices. then you try both and get 90 is the optimal solution.

More generally, you can use the depth-first-search algorithm. The trick is when you find there is one character which only appears at the beginning, just choose it, it will path to the optimal solution.

Upvotes: 2

Aseem Goyal
Aseem Goyal

Reputation: 2723

I think we can start with graphs .
I am not sure of correctness , but say if we build a graph with a->b if b comes after a , with all paths of length 1.
Now , we have to find longest distance (DFS can help) .
I will try to provide example .

say strings are :

abc
acd
bcd
bce

we form a graph :
enter image description here
Now main thing left will be to combine nodes e and d because the required string can be abcde or abced.

This is what i am not sure how to do , so maybe somebody can help !

Sorry for posting this as answer , comments can't include pictures.

Upvotes: 1

Peng Zhang
Peng Zhang

Reputation: 3585

This problem is well studied and coined "Shortest common supersequence"

For two strings, it is in O(n). Suppose n is the maximum length of string. O(n) is used to build the suffix array and then O(n) to solve the Longest Common Subsequence problem.

For more than two strings, it is NP-Complete.

Upvotes: 5

Related Questions