Colonel Panic
Colonel Panic

Reputation: 137554

How to recover a number from digit subsequences?

Given the four digit number 1234, there are six possible two digit subsequences (12, 13, 14, 23, 24, 34). Given some of the subsequences, is it possible to recover the original number?

Here's some example data. Each line lists some 3 digit subsequences of a different 6 digit number (to be found)

528, 508, 028, 502, 058, 528, 028, 528, 552, 050
163, 635, 635, 130, 163, 633, 130, 330, 635, 135
445, 444, 444, 444, 454, 444, 445, 
011, 350, 601, 651, 601, 511, 511, 360, 601, 351
102, 021, 102, 221, 102, 100, 002, 021, 021, 121
332, 111, 313, 311, 132, 113, 132, 111, 112
362, 650, 230, 172, 120, 165, 372, 202, 702
103, 038, 138, 150, 110, 518, 510, 538, 108
343, 231, 431, 341, 203, 203, 401, 303, 031, 233

Edit: Sometimes the solution might not be unique (more than one number could have given the subsequences). In that case, it would be good to return one of them, or maybe even a list.

Upvotes: 3

Views: 249

Answers (3)

A. Webb
A. Webb

Reputation: 26446

Let logic programming do the work for you.

This is via core.logic in .

Define what it means to be a subsequence

(defne subseqo [s1 s2] 
  ([(h . t1) (h . t2)] (subseqo t1 t2)) 
  ([(h1 . t1) (h2 . t2)] (!= h1 h2) (subseqo s1 t2)) 
  ([() _]))

Run the constraints through the solver.

(defn recover6 [input-string] 
  (run* [q] 
    (fresh [a b c d e f] 
      (== q [a b c d e f]) 
      (everyg (fn [s] (subseqo (seq s) q)) 
              (re-seq #"\d+" input-string)))))

Examples (results are perceptually instantaneous at the REPL):

(recover6 "528, 508, 028, 502, 058, 528, 028, 528, 552, 050")
;=> ([\5 \0 \5 \2 \8 \0]
     [\5 \0 \5 \2 \0 \8]
     [\5 \0 \5 \0 \2 \8]
     [\0 \5 \0 \5 \2 \8]
     [\0 \5 \5 \0 \2 \8])

(recover6 "163, 635, 635, 130, 163, 633, 130, 330, 635, 135")
;=> ([\1 \6 \3 \5 \3 \0] 
     [\1 \6 \3 \3 \5 \0] 
     [\1 \6 \3 \3 \0 \5])

(recover6 "445, 444, 444, 444, 454, 444, 445")
;=> ([\4 \4 \5 \4 _0 _1] 
     ... and many more

In the last example, the underscores indicate that _0 and _1 are free variables. They have not been constrained. It is easy enough to constrain any free variables to the set of digits.

Upvotes: 2

Bernhard Barker
Bernhard Barker

Reputation: 55599

Build a directed graph with each digit connected to the digit following it in each sequence.

Dealing with cycles:

A cycle implies an impossible scenario - a same character cannot have 2 locations (there can be characters with the same values at multiple positions, but not the exact same character - as a metaphor, you can have many people named Bob, but any given Bob can only be at one place). Some node must be split into multiple nodes. The chosen node should be split such that all incoming edges is in one of the new nodes and all outgoing edges is in the other, and there's a connection between the two.

There should be multiple nodes which can be picked to be split, with possibly only one being correct, you may need to explore all possibilities until you find one that works. If one doesn't work, you'll get a longer string than is allowed somewhere down the line.

It might be a better idea to leave getting rid of cycles completely until right before the topological sort (resolving them in the same way).

Dealing with nodes with the same value (as a result of cycle resolution):

If there are multiple nodes with the same value that can be chosen, let the outgoing edges go from the first one (the one that has a directed path to all the others) and the incoming edges to the last one (the one which all other ones has a directed path to). This obviously needs to be slightly modified if there are multiple digits with the same value in the same sequence.

Finding the actual string:

To determine the string, do a topological sort on the graph.

Example:

Assume we're looking for a 5-digit number and the input is:

528, 508, 028, 502, 058, 058

I know the duplication of 058 is somewhat trivial, but it's just for illustration.

For 528, create nodes for 5, 2 and 8, and connect 5 and 2 and 2 and 8.

5 -> 2 -> 8

For 508, create 0, connect 5 and 0 and 0 and 8.

5 -> 2 -> 8
  \    /
   > 0

For 028, connect 0 and 2.

5 ------> 2 -> 8
  \    /    /
   > 0 -----

For 502, all connections are already there.

For 058, we get a cycle (5->0->5), so we have 2 choices:

  • Split 0 into 2 nodes:

      /-----------\----\
     /             v    v
    0 -> 5 ------> 2 -> 8
           \
            > 0
    
  • Split 5 into 2 nodes:

      /-----------\
     /             v
    5 ------> 2 -> 5 -> 8
     \        ^    ^
      \       /    /
       > 0 --------
    

Let's assume we go with the latter.

For 058, we need an outgoing edge from the last 5 (the right 5 in this case) and an incoming edge from the first 5 (the left 5 in this case). These edges (5->0 and 5->8) already exists, so there's nothing to do.

A topological sort will give us 50258, which is our number.

Upvotes: 3

Thomas Ahle
Thomas Ahle

Reputation: 31604

What you want to do is to find the Shortest common supersequence of all your subsequences. Clearly if you have all subsequences, including the original number, the SCS will be what you are looking for. Otherwise it can't be guaranteed, but there's a good chance.

Unfortunately there isn't a nice polynomial algorithm for this problem, but if you Google it you'll find there are a lot of approximation algorithms available. E.g. An ACO Algorithm for the Shortest Common Supersequene Problem which mentions there are three overall approaches:

  1. Dynamic Programming or Branch'n'Bound. These are usually to slow except for very few strings or small alphabets.

  2. Finding the SCS of the strings pairwise using Dynamic Programming, using heuristics to choose which strings to 'merge'.

  3. The Majority Merge heuristic which might be the nicest one for your case.

  4. The approach described in the paper.

Here is another nice article about the problem: http://www.update.uu.se/~shikaree/Westling/

Upvotes: 3

Related Questions