Anisotropic
Anisotropic

Reputation: 645

Remove consecutive duplicates in a string to make the smallest string

Given a string and the constraint of matching on >= 3 characters, how can you ensure that the result string will be as small as possible?

edit with gassa's explicitness:

E.G. 'AAAABBBAC'

If I remove the B's first, AAAA[BBB]AC -- > AAAAAC, then I can remove all of the A's from the resultant string and be left with:

[AAAAA]C --> C

'C'

If I just remove what is available first (the sequence of A's), I get:

[AAAA]BBBAC -- > [BBB]AC --> AC

'AC'

Upvotes: 14

Views: 4175

Answers (5)

javidcf
javidcf

Reputation: 59691

Here is a Python solution (function reduce_min), not particularly smart but I think fairly easy to understand (excessive amount of comments added for answer clarity):

def reductions(s, min_len):
    """
    Yields every possible reduction of s by eliminating contiguous blocks
    of l or more repeated characters.
    For example, reductions('AAABBCCCCBAAC', 3) yields
    'BBCCCCBAAC' and 'AAABBBAAC'.
    """
    # Current character
    curr = ''
    # Length of current block
    n = 0
    # Start position of current block
    idx = 0
    # For each character
    for i, c in enumerate(s):
        if c != curr:
            # New block begins
            if n >= min_len:
                # If previous block was long enough
                # yield reduced string without it
                yield s[:idx] + s[i:]
            # Start new block
            curr = c
            n = 1
            idx = i
        else:
            # Still in the same block
            n += 1
    # Yield reduction without last block if it was long enough
    if n >= min_len:
        yield s[:idx]

def reduce_min(s, min_len):
    """
    Finds the smallest possible reduction of s by successive
    elimination of contiguous blocks of min_len or more repeated
    characters.
    """
    # Current set of possible reductions
    rs = set([s])
    # Current best solution
    result = s
    # While there are strings to reduce
    while rs:
        # Get one element
        r = rs.pop()
        # Find reductions
        r_red = list(reductions(r, min_len))
        # If no reductions are found it is irreducible
        if len(r_red) == 0 and len(r) < len(result):
            # Replace if shorter than current best
            result = r
        else:
            # Save reductions for next iterations
            rs.update(r_red)
    return result

assert reduce_min("BAABCCCBBA", 3) == "B"
assert reduce_min("AABBAAAC", 3) == "AABBC"
assert reduce_min("AAAA", 3) == ""
assert reduce_min("AAAABBBAC", 3) == "C"

EDIT: Since people seem to be posting C++ solutions, here is mine in C++ (again, function reduce_min):

#include <string>
#include <vector>
#include <unordered_set>
#include <iterator>
#include <utility>
#include <cassert>

using namespace std;

void reductions(const string &s, unsigned int min_len, vector<string> &rs)
{
    char curr = '\0';
    unsigned int n = 0;
    unsigned int idx = 0;
    for (auto it = s.begin(); it != s.end(); ++it)
    {
        if (curr != *it)
        {
            auto i = distance(s.begin(), it);
            if (n >= min_len)
            {
                rs.push_back(s.substr(0, idx) + s.substr(i));
            }
            curr = *it;
            n = 1;
            idx = i;
        }
        else
        {
            n += 1;
        }
    }
    if (n >= min_len)
    {
        rs.push_back(s.substr(0, idx));
    }
}

string reduce_min(const string &s, unsigned int min_len)
{
    unordered_set<string> rs { s };
    string result = s;
    vector<string> rs_new;
    while (!rs.empty())
    {
        auto it = rs.begin();
        auto r = *it;
        rs.erase(it);
        rs_new.clear();
        reductions(r, min_len, rs_new);
        if (rs_new.empty() && r.size() < result.size())
        {
            result = move(r);
        }
        else
        {
            rs.insert(rs_new.begin(), rs_new.end());
        }
    }
    return result;
}

int main(int argc, char **argv)
{
    assert(reduce_min("BAABCCCBBA", 3) == "B");
    assert(reduce_min("AABBAAAC", 3) == "AABBC");
    assert(reduce_min("AAAA", 3) == "");
    assert(reduce_min("AAAABBBAC", 3) == "C");
    return 0;
}

If you can use C++17 you can save memory by using string views.


EDIT 2: About the complexity of the algorithm. It is not straightforward to figure out, and as I said the algorithm is meant to be simple more than anything, but let's see. In the end, it is more or less the same as a breadth-first search. Let's say the string length is n, and, for generality, let's say the minimum block length (value 3 in the question) is m. In the first level, we can generate up to n / m reductions in the worst case. For each of these, we can generate up to (n - m) / m reductions, and so on. So basically, at "level" i (loop iteration i) we create up to (n - i * m) / m reductions per string we had, and each of these will take O(n - i * m) time to process. The maximum number of levels we can have is, again, n / m. So the complexity of the algorithm (if I'm not making mistakes) should have the form:

O( sum {i = 0 .. n / m} ( O(n - i * m) * prod {j = 0 .. i} ((n - i * m) / m) ))
       |-Outer iters--|   |---Cost---|        |-Prev lvl-| |---Branching---|

Whew. So this should be something like:

O( sum {i = 0 .. n / m} (n - i * m) * O(n^i / m^i) )

Which in turn would collapse to:

O((n / m)^(n / m))

So yeah, the algorithm is more or less simple, but it can run into exponential cost cases (the bad cases would be strings made entirely of exactly m-long blocks, like AAABBBCCCAAACCC... for m = 3).

Upvotes: 1

גלעד ברקן
גלעד ברקן

Reputation: 23955

Clearly we are not concerned about any block of repeated characters longer than 2 characters. And there is only one way two blocks of the same character where at least one of the blocks is less than 3 in length can be combined - namely, if the sequence between them can be removed.

So (1) look at pairs of blocks of the same character where at least one is less than 3 in length, and (2) determine if the sequence between them can be removed.

We want to decide which pairs to join so as to minimize the total length of blocks less than 3 characters long. (Note that the number of pairs is bound by the size (and distribution) of the alphabet.)

Let f(b) represent the minimal total length of same-character blocks remaining up to the block b that are less than 3 characters in length. Then:

f(b):
  p1 <- previous block of the same character

  if b and p1 can combine:
    if b.length + p1.length > 2:
      f(b) = min(
        // don't combine
        (0 if b.length > 2 else b.length) +
          f(block before b),
        // combine
        f(block before p1)
      )

    // b.length + p1.length < 3
    else:
      p2 <- block previous to p1 of the same character

      if p1 and p2 can combine:
        f(b) = min(
          // don't combine
          b.length + f(block before b),
          // combine
          f(block before p2)
        )
      else:
        f(b) = b.length + f(block before b)

  // b and p1 cannot combine
  else:
    f(b) = b.length + f(block before b)

  for all p1 before b

The question is how can we efficiently determine if a block can be combined with the previous block of the same character (aside from the obvious recursion into the sub-block-list between the two blocks).

Python code:

import random
import time

def parse(length):
  return length if length < 3 else 0

def f(string):
  chars = {}
  blocks = [[string[0], 1, 0]]

  chars[string[0]] = {'indexes': [0]}
  chars[string[0]][0] = {'prev': -1}

  p = 0 # pointer to current block

  for i in xrange(1, len(string)):
    if blocks[len(blocks) - 1][0] == string[i]:
      blocks[len(blocks) - 1][1] += 1
    else:
      p += 1
      # [char, length, index, f(i), temp] 
      blocks.append([string[i], 1, p])

      if string[i] in chars:
        chars[string[i]][p] = {'prev': chars[string[i]]['indexes'][ len(chars[string[i]]['indexes']) - 1 ]}
        chars[string[i]]['indexes'].append(p)
      else:
        chars[string[i]] = {'indexes': [p]}
        chars[string[i]][p] = {'prev': -1}

  #print blocks
  #print
  #print chars
  #print

  memo = [[None for j in xrange(len(blocks))] for i in xrange(len(blocks))]

  def g(l, r, top_level=False):
    ####
    ####
    #print "(l, r): (%s, %s)" % (l,r)

    if l == r:
      return parse(blocks[l][1])
    if memo[l][r]:
      return memo[l][r]

    result = [parse(blocks[l][1])] + [None for k in xrange(r - l)]

    if l < r:
      for i in xrange(l + 1, r + 1):
        result[i - l] = parse(blocks[i][1]) + result[i - l - 1]

    for i in xrange(l, r + 1):
      ####
      ####
      #print "\ni: %s" % i

      [char, length, index] = blocks[i]
      #p1 <- previous block of the same character
      p1_idx = chars[char][index]['prev']

      ####
      ####
      #print "(p1_idx, l, p1_idx >= l): (%s, %s, %s)" % (p1_idx, l, p1_idx >= l)

      if p1_idx < l and index > l:
        result[index - l] = parse(length) + result[index - l - 1]

      while p1_idx >= l:
        p1 = blocks[p1_idx]

        ####
        ####
        #print "(b, p1, p1_idx, l): (%s, %s, %s, %s)\n" % (blocks[i], p1, p1_idx, l)

        between = g(p1[2] + 1, index - 1)

        ####
        ####
        #print "between: %s" % between

        #if b and p1 can combine:
        if between == 0:
          if length + p1[1] > 2:
            result[index - l] = min(
              result[index - l],
              # don't combine
              parse(length) + (result[index - l - 1] if index - l > 0 else 0),
              # combine: f(block before p1)
              result[p1[2] - l - 1] if p1[2] > l else 0
            )

          # b.length + p1.length < 3
          else:
            #p2 <- block previous to p1 of the same character
            p2_idx = chars[char][p1[2]]['prev']

            if p2_idx < l:
              p1_idx = chars[char][p1_idx]['prev']
              continue

            between2 = g(p2_idx + 1, p1[2] - 1)
            #if p1 and p2 can combine:
            if between2 == 0:
              result[index - l] = min(
                result[index - l],
                # don't combine
                parse(length) + (result[index - l - 1] if index - l > 0 else 0),
                # combine the block, p1 and p2
                result[p2_idx - l - 1] if p2_idx - l > 0 else 0
              )
            else:
              #f(b) = b.length + f(block before b)
              result[index - l] = min(
                result[index - l],
                parse(length) + (result[index - l - 1] if index - l > 0 else 0)
              )

        # b and p1 cannot combine
        else:
          #f(b) = b.length + f(block before b)
          result[index - l] = min(
            result[index - l],
            parse(length) + (result[index - l - 1] if index - l > 0 else 0)
          )

        p1_idx = chars[char][p1_idx]['prev']

    #print l,r,result
    memo[l][r] = result[r - l]

    """if top_level:
      return (result, blocks)
    else:"""
    return result[r - l]

  if len(blocks) == 1:
    return ([parse(blocks[0][1])], blocks)
  else:
    return g(0, len(blocks) - 1, True)

"""s = ""

for i in xrange(300):
  s = s + ['A','B','C'][random.randint(0,2)]"""

print f("abcccbcccbacccab") # b
print
print f("AAAABBBAC");      # C
print
print f("CAAAABBBA");      # C
print
print f("AABBAAAC");      # AABBC
print
print f("BAABCCCBBA");    # B
print
print f("aaaa")
print

The string answers for these longer examples were computed using jdehesa's answer:

t0 = time.time()
print f("BCBCCBCCBCABBACCBABAABBBABBBACCBBBAABBACBCCCACABBCAABACBBBBCCCBBAACBAABACCBBCBBAABCCCCCAABBBBACBBAAACACCBCCBBBCCCCCCCACBABACCABBCBBBBBCBABABBACCAACBCBBAACBBBBBCCBABACBBABABAAABCCBBBAACBCACBAABAAAABABB")
# BCBCCBCCBCABBACCBABCCAABBACBACABBCAABACAACBAABACCBBCBBCACCBACBABACCABBCCBABABBACCAACBCBBAABABACBBABABBCCAACBCACBAABBABB
t1 = time.time()
total = t1-t0
print total

t0 = time.time()
print f("CBBACAAAAABBBBCAABBCBAABBBCBCBCACACBAABCBACBBABCABACCCCBACBCBBCBACBBACCCBAAAACACCABAACCACCBCBCABAACAABACBABACBCBAACACCBCBCCCABACABBCABBAAAAABBBBAABAABBCACACABBCBCBCACCCBABCAACBCAAAABCBCABACBABCABCBBBBABCBACABABABCCCBBCCBBCCBAAABCABBAAABBCAAABCCBAABAABCAACCCABBCAABCBCBCBBAACCBBBACBBBCABAABCABABABABCA")
# CBBACCAABBCBAACBCBCACACBAABCBACBBABCABABACBCBBCBACBBABCACCABAACCACCBCBCABAACAABACBABACBCBAACACCBCBABACABBCBBCACACABBCBCBCABABCAACBCBCBCABACBABCABCABCBACABABACCBBCCBBCACBCCBAABAABCBBCAABCBCBCBBAACCACCABAABCABABABABCA
t1 = time.time()
total = t1-t0
print total

t0 = time.time()
print f("AADBDBEBBBBCABCEBCDBBBBABABDCCBCEBABADDCABEEECCECCCADDACCEEAAACCABBECBAEDCEEBDDDBAAAECCBBCEECBAEBEEEECBEEBDACDDABEEABEEEECBABEDDABCDECDAABDAEADEECECEBCBDDAEEECCEEACCBBEACDDDDBDBCCAAECBEDAAAADBEADBAAECBDEACDEABABEBCABDCEEAABABABECDECADCEDAEEEBBBCEDECBCABDEDEBBBABABEEBDAEADBEDABCAEABCCBCCEDCBBEBCECCCA")
# AADBDBECABCEBCDABABDCCBCEBABADDCABCCEADDACCEECCABBECBAEDCEEBBECCBBCEECBAEBCBEEBDACDDABEEABCBABEDDABCDECDAABDAEADEECECEBCBDDACCEEACCBBEACBDBCCAAECBEDDBEADBAAECBDEACDEABABEBCABDCEEAABABABECDECADCEDACEDECBCABDEDEABABEEBDAEADBEDABCAEABCCBCCEDCBBEBCEA
t1 = time.time()
total = t1-t0
print total

Upvotes: 3

user unknown
user unknown

Reputation: 36229

Another scala answer, using memoization and tailcall optimization (partly) (updated).

import scala.collection.mutable.HashSet
import scala.annotation._

object StringCondense extends App {

  @tailrec
  def groupConsecutive (s: String, sofar: List[String]): List[String] = s.toList match {
  // def groupConsecutive (s: String): List[String] = s.toList match {
      case Nil => sofar
      // case Nil => Nil
      case c :: str => {
          val (prefix, rest) = (c :: str).span (_ == c)
          // Strings of equal characters, longer than 3, don't make a difference to just 3
          groupConsecutive (rest.mkString(""), (prefix.take (3)).mkString ("") :: sofar)
          // (prefix.take (3)).mkString ("") :: groupConsecutive (rest.mkString(""))
      }
  }
  // to count the effect of memoization
  var count = 0

// recursively try to eliminate every group of 3 or more, brute forcing
// but for "aabbaabbaaabbbaabb", many reductions will lead sooner or 
// later to the same result, so we try to detect these and avoid duplicate 
// work 

  def moreThan2consecutive (s: String, seenbefore: HashSet [String]): String = {
      if (seenbefore.contains (s)) s
      else
      {
        count += 1
        seenbefore += s
        val sublists = groupConsecutive (s, Nil)
        // val sublists = groupConsecutive (s)
        val atLeast3 = sublists.filter (_.size > 2)
        atLeast3.length match {
            case 0 => s
            case 1 => {
                val res = sublists.filter (_.size < 3)
                moreThan2consecutive (res.mkString (""), seenbefore)
            }
            case _ => {
                val shrinked = (
                    for {idx <- (0 until sublists.size)
                    if (sublists (idx).length >= 3)
                    pre = (sublists.take (idx)).mkString ("")
                    post= (sublists.drop (idx+1)).mkString ("")
                    } yield {
                        moreThan2consecutive (pre + post, seenbefore)
                    }
                )
                (shrinked.head /: shrinked.tail) ((a, b) => if (a.length <= b.length) a else b)
            }
        }
      }
  }

    // don't know what Rcd means, adopted from other solution but modified 
    // kind of a unit test **update**: forgot to reset count 
    testRcd (s: String, expected: String) : Boolean = {
        count = 0
        val seenbefore = HashSet [String] ()
        val result = moreThan2consecutive (s, seenbefore)
        val hit = result.equals (expected)
        println (s"Input: $s\t result: ${result}\t expected ${expected}\t $hit\t count: $count");
        hit
    }

    // some test values from other users with expected result
    // **upd:** more testcases
    def testgroup () : Unit = {
        testRcd ("baabcccbba", "b")
        testRcd ("aabbaaac", "aabbc")
        testRcd ("aaaa", "")
        testRcd ("aaaabbbac", "c")

        testRcd ("abcccbcccbacccab", "b")
        testRcd ("AAAABBBAC", "C")
        testRcd ("CAAAABBBA", "C")
        testRcd ("AABBAAAC", "AABBC")
        testRcd ("BAABCCCBBA", "B")

        testRcd ("AAABBBAAABBBAAABBBC", "C")    // 377 subcalls reported by Yola,
        testRcd ("AAABBBAAABBBAAABBBAAABBBC", "C")  // 4913 when preceeded with AAABBB
    }
    testgroup

    def testBigs () : Unit = {
    /*
      testRcd ("BCBCCBCCBCABBACCBABAABBBABBBACCBBBAABBACBCCCACABBCAABACBBBBCCCBBAACBAABACCBBCBBAABCCCCCAABBBBACBBAAACACCBCCBBBCCCCCCCACBABACCABBCBBBBBCBABABBACCAACBCBBAACBBBBBCCBABACBBABABAAABCCBBBAACBCACBAABAAAABABB",
               "BCBCCBCCBCABBACCBABCCAABBACBACABBCAABACAACBAABACCBBCBBCACCBACBABACCABBCCBABABBACCAACBCBBAABABACBBABABBCCAACBCACBAABBABB")
    */
    testRcd ("CBBACAAAAABBBBCAABBCBAABBBCBCBCACACBAABCBACBBABCABACCCCBACBCBBCBACBBACCCBAAAACACCABAACCACCBCBCABAACAABACBABACBCBAACACCBCBCCCABACABBCABBAAAAABBBBAABAABBCACACABBCBCBCACCCBABCAACBCAAAABCBCABACBABCABCBBBBABCBACABABABCCCBBCCBBCCBAAABCABBAAABBCAAABCCBAABAABCAACCCABBCAABCBCBCBBAACCBBBACBBBCABAABCABABABABCA",
             "CBBACCAABBCBAACBCBCACACBAABCBACBBABCABABACBCBBCBACBBABCACCABAACCACCBCBCABAACAABACBABACBCBAACACCBCBABACABBCBBCACACABBCBCBCABABCAACBCBCBCABACBABCABCABCBACABABACCBBCCBBCACBCCBAABAABCBBCAABCBCBCBBAACCACCABAABCABABABABCA")
    /*testRcd ("AADBDBEBBBBCABCEBCDBBBBABABDCCBCEBABADDCABEEECCECCCADDACCEEAAACCABBECBAEDCEEBDDDBAAAECCBBCEECBAEBEEEECBEEBDACDDABEEABEEEECBABEDDABCDECDAABDAEADEECECEBCBDDAEEECCEEACCBBEACDDDDBDBCCAAECBEDAAAADBEADBAAECBDEACDEABABEBCABDCEEAABABABECDECADCEDAEEEBBBCEDECBCABDEDEBBBABABEEBDAEADBEDABCAEABCCBCCEDCBBEBCECCCA",
               "AADBDBECABCEBCDABABDCCBCEBABADDCABCCEADDACCEECCABBECBAEDCEEBBECCBBCEECBAEBCBEEBDACDDABEEABCBABEDDABCDECDAABDAEADEECECEBCBDDACCEEACCBBEACBDBCCAAECBEDDBEADBAAECBDEACDEABABEBCABDCEEAABABABECDECADCEDACEDECBCABDEDEABABEEBDAEADBEDABCAEABCCBCCEDCBBEBCEA")
    */
  }

  // for generated input, but with fixed seed, to compare the count with
    // and without memoization
    import util.Random
    val r = new Random (31415)

    // generate Strings but with high chances to produce some triples and 
    // longer sequences of char clones    
    def genRandomString () : String = {
      (1 to 20).map (_ => r.nextInt (6) match {
        case 0 => "t"
        case 1 => "r"
        case 2 => "-"
        case 3 => "tt"
        case 4 => "rr"
        case 5 => "--"
      }).mkString ("")
    }

    def testRandom () : Unit = {
      (1 to 10).map (i=> testRcd (genRandomString, "random mode - false might be true"))
    }

    testRandom

  testgroup
  testRandom
  // testBigs
}

Comparing the effect of memoization lead to interesting results:

Updated measurements. In the old values, I forgot to reset the counter, which leaded to much higher results. Now the spreading of results is much more impressive and in total, the values are smaller.

No seenbefore:
Input: baabcccbba    result: b   expected b  true    count: 4
Input: aabbaaac      result: aabbc   expected aabbc  true    count: 2
Input: aaaa      result:     expected    true    count: 2
Input: aaaabbbac     result: c   expected c  true    count: 5
Input: abcccbcccbacccab  result: b   expected b  true    count: 34
Input: AAAABBBAC     result: C   expected C  true    count: 5
Input: CAAAABBBA     result: C   expected C  true    count: 5
Input: AABBAAAC      result: AABBC   expected AABBC  true    count: 2
Input: BAABCCCBBA    result: B   expected B  true    count: 4
Input: AAABBBAAABBBAAABBBC  res: C   expected C  true    count: 377
Input: AAABBBAAABBBAAABBBAAABBBC r: C    expected C  true    count: 4913
Input: r--t----ttrrrrrr--tttrtttt--rr----result: rr--rr     expected ? unknown ?     false   count: 1959
Input: ttrtt----tr---rrrtttttttrtr--rr   result: r--rr      expected ? unknown ?     false   count: 213
Input: tt----r-----ttrr----ttrr-rr--rr-- result: ttrttrrttrr-rr--rr-- ex ? unknown ?     false   count: 16
Input: --rr---rrrrrrr-r--rr-r--tt--rrrrr result: rr-r--tt-- expected ? unknown ?     false   count: 32
Input: tt-rrrrr--r--tt--rrtrrr-------    result: ttr--tt--rrt   expected ? unknown ?     false   count: 35
Input: --t-ttt-ttt--rrrrrt-rrtrttrr  result: --tt-rrtrttrr  expected ? unknown ?     false   count: 35
Input: rrt--rrrr----trrr-rttttrrtttrr    result: rrtt-      expected ? unknown ?     false   count: 1310
Input: ---tttrrrrrttrrttrr---tt-----tt   result: rrttrr     expected ? unknown ?     false   count: 1011
Input: -rrtt--rrtt---t-r--r---rttr--     result: -rrtt--rr-r--rrttr-- ex ? unknown ?     false   count: 9
Input: rtttt--rrrrrrrt-rrttt--tt--t  result: r--t-rr--tt--t  expectd ? unknown ?     false   count: 16
real    0m0.607s    (without testBigs)
user    0m1.276s
sys 0m0.056s

With seenbefore:
Input: baabcccbba    result: b   expected b  true    count: 4
Input: aabbaaac      result: aabbc   expected aabbc  true    count: 2
Input: aaaa      result:     expected    true    count: 2
Input: aaaabbbac     result: c   expected c  true    count: 5
Input: abcccbcccbacccab  result: b   expected b  true    count: 11
Input: AAAABBBAC     result: C   expected C  true    count: 5
Input: CAAAABBBA     result: C   expected C  true    count: 5
Input: AABBAAAC      result: AABBC   expected AABBC  true    count: 2
Input: BAABCCCBBA    result: B   expected B  true    count: 4
Input: AAABBBAAABBBAAABBBC rest: C   expected C  true    count: 28
Input: AAABBBAAABBBAAABBBAAABBBC C   expected C  true    count: 52
Input: r--t----ttrrrrrr--tttrtttt--rr----result: rr--rr     expected ? unknown ?     false   count: 63
Input: ttrtt----tr---rrrtttttttrtr--rr   result: r--rr      expected ? unknown ?     false   count: 48
Input: tt----r-----ttrr----ttrr-rr--rr-- result: ttrttrrttrr-rr--rr-- xpe? unknown ?     false   count: 8
Input: --rr---rrrrrrr-r--rr-r--tt--rrrrr result: rr-r--tt-- expected ? unknown ?     false   count: 19
Input: tt-rrrrr--r--tt--rrtrrr-------    result: ttr--tt--rrt   expected ? unknown ?     false   count: 12
Input: --t-ttt-ttt--rrrrrt-rrtrttrr  result: --tt-rrtrttrr  expected ? unknown ?     false   count: 16
Input: rrt--rrrr----trrr-rttttrrtttrr    result: rrtt-      expected ? unknown ?     false   count: 133
Input: ---tttrrrrrttrrttrr---tt-----tt   result: rrttrr     expected ? unknown ?     false   count: 89
Input: -rrtt--rrtt---t-r--r---rttr--     result: -rrtt--rr-r--rrttr-- ex ? unknown ?     false   count: 6
Input: rtttt--rrrrrrrt-rrttt--tt--t  result: r--t-rr--tt--t expected ? unknown ?     false   count: 8
real    0m0.474s    (without testBigs)
user    0m0.852s
sys 0m0.060s

With tailcall:
real    0m0.478s    (without testBigs)
user    0m0.860s
sys 0m0.060s

For some random strings, the difference is bigger than a 10fold.

For long Strings with many groups one could, as an improvement, eliminate all groups which are the only group of that character, for instance:

aa bbb aa ccc xx ddd aa eee aa fff xx 

The groups bbb, ccc, ddd, eee and fff are unique in the string, so they can't fit to something else and could all be eliminated, and the order of removal is will not matter. This would lead to the intermediate result

aaaa xx aaaa xx 

and a fast solution. Maybe I try to implement it too. However, I guess, it will be possible to produce random Strings, where this will have a big impact and by a different form of random generated strings, to distributions, where the impact is low.

Upvotes: 1

Bassem
Bassem

Reputation: 820

A tree would definitely get you the shortest string(s).

The tree solution:

  1. Define a State (node) for each current string Input and all its removable sub-strings' int[] Indexes.
  2. Create the tree: For each int index create another State and add it to the parent state State[] Children.
  3. A State with no possible removable sub-strings has no children Children = null.
  4. Get all Descendants State[] of your root State. Order them by their shortest string Input. And that is/are your answer(s).

Test cases:

string result = FindShortest("AAAABBBAC");      // AC
string result2 = FindShortest("AABBAAAC");      // AABBC
string result3 = FindShortest("BAABCCCBBA");    // B

The Code:

Note: Of-course everyone is welcome to enhance the following code in terms of performance and/or fixing any bug.

class Program
{
    static void Main(string[] args)
    {
        string result = FindShortest("AAAABBBAC");      // AC
        string result2 = FindShortest("AABBAAAC");      // AABBC
        string result3 = FindShortest("BAABCCCBBA");    // B
    }

    // finds the FIRST shortest string for a given input
    private static string FindShortest(string input)
    {
        // all possible removable strings' indexes
        // for this given input
        int[] indexes = RemovableIndexes(input);

        // each input string and its possible removables are a state
        var state = new State { Input = input, Indexes = indexes };

        // create the tree
        GetChildren(state);

        // get the FIRST shortest
        // i.e. there would be more than one answer sometimes
        // this could be easily changed to get all possible results
        var result = 
            Descendants(state)
            .Where(d => d.Children == null || d.Children.Length == 0)
            .OrderBy(d => d.Input.Length)
            .FirstOrDefault().Input;


        return result;
    }

    // simple get all descendants of a node/state in a tree
    private static IEnumerable<State> Descendants(State root)
    {
        var states = new Stack<State>(new[] { root });
        while (states.Any())
        {
            State node = states.Pop();
            yield return node;
            if (node.Children != null)
                foreach (var n in node.Children) states.Push(n);
        }
    }

    // creates the tree
    private static void GetChildren(State state)
    {
        // for each an index there is a child
        state.Children = state.Indexes.Select(
                i =>
                {
                    var input = RemoveAllAt(state.Input, i);
                    return input.Length < state.Input.Length && input.Length > 0
                    ? new State
                    {
                        Input = input,
                        Indexes = RemovableIndexes(input)
                    }
                    : null;
                }).ToArray();

        foreach (var c in state.Children)
            GetChildren(c);
    }

    // find all possible removable strings' indexes
    private static int[] RemovableIndexes(string input)
    {
        var indexes = new List<int>();

        char d = input[0];
        int count = 1;
        for (int i = 1; i < input.Length; i++)
        {
            if (d == input[i])
                count++;
            else
            {
                if (count >= 3)
                    indexes.Add(i - count);

                // reset
                d = input[i];
                count = 1;
            }
        }
        if (count >= 3)
            indexes.Add(input.Length - count);


        return indexes.ToArray();
    }

    // remove all duplicate chars starting from an index
    private static string RemoveAllAt(string input, int startIndex)
    {
        string part1, part2;

        int endIndex = startIndex + 1;
        int i = endIndex;
        for (; i < input.Length; i++)
            if (input[i] != input[startIndex])
            {
                endIndex = i;
                break;
            }

        if (i == input.Length && input[i - 1] == input[startIndex])
            endIndex = input.Length;

        part1 = startIndex > 0 ? input.Substring(0, startIndex) : string.Empty;
        part2 = endIndex <= (input.Length - 1) ? input.Substring(endIndex) : string.Empty;

        return part1 + part2;
    }

    // our node, which is 
    // an input string & 
    // all possible removable strings' indexes
    // & its children
    public class State
    {
        public string Input;
        public int[] Indexes;

        public State[] Children;
    }
}

Upvotes: 4

Yola
Yola

Reputation: 19023

I propose O(n^2) solution with dynamic programming.

Let's introduce notation. Prefix and suffix of length l of string A denoted by P[l] and S[l]. And we call our procedure Rcd.

  1. Rcd(A) = Rcd(Rcd(P[n-1])+S[1])
  2. Rcd(A) = Rcd(P[1]+Rcd(S[n-1]))

Note that outer Rcd in the RHS is trivial. So, that's our optimal substructure. Based on this i came up with the following implementation:

#include <iostream>
#include <string>
#include <vector>
#include <cassert>
using namespace std;

string remdupright(string s, bool allowEmpty) {
    if (s.size() >= 3) {
        auto pos = s.find_last_not_of(s.back());
        if (pos == string::npos && allowEmpty) s = "";
        else if (pos != string::npos && s.size() - pos > 3) s = s.substr(0, pos + 1);
    }
    return s;
}

string remdupleft(string s, bool allowEmpty) {
    if (s.size() >= 3) {
        auto pos = s.find_first_not_of(s.front());
        if (pos == string::npos && allowEmpty) s = "";
        else if (pos != string::npos && pos >= 3) s = s.substr(pos);
    }
    return s;
}

string remdup(string s, bool allowEmpty) {
    return remdupleft(remdupright(s, allowEmpty), allowEmpty);
}

string run(const string in) {
    vector<vector<string>> table(in.size());
    for (int i = 0; i < (int)table.size(); ++i) {
        table[i].resize(in.size() - i);
    }
    for (int i = 0; i < (int)table[0].size(); ++i) {
        table[0][i] = in.substr(i,1);
    }

    for (int len = 2; len <= (int)table.size(); ++len) {
        for (int pos = 0; pos < (int)in.size() - len + 1; ++pos) {
            string base(table[len - 2][pos]);
            const char suffix = in[pos + len - 1];
            if (base.size() && suffix != base.back()) {
                base = remdupright(base, false);
            }
            const string opt1 = base + suffix;

            base = table[len - 2][pos+1];
            const char prefix = in[pos];
            if (base.size() && prefix != base.front()) {
                base = remdupleft(base, false);
            }
            const string opt2 = prefix + base;

            const string nodupopt1 = remdup(opt1, true);
            const string nodupopt2 = remdup(opt2, true);

            table[len - 1][pos] = nodupopt1.size() > nodupopt2.size() ? opt2 : opt1;
            assert(nodupopt1.size() != nodupopt2.size() || nodupopt1 == nodupopt2);
        }
    }
    string& res = table[in.size() - 1][0];
    return remdup(res, true);
}

void testRcd(string s, string expected) {
    cout << s << " : " << run(s) << ", expected: " << expected << endl;
}

int main()
{
    testRcd("BAABCCCBBA", "B");
    testRcd("AABBAAAC", "AABBC");
    testRcd("AAAA", "");
    testRcd("AAAABBBAC", "C");
}

You can check default and run your tests here.

Upvotes: 3

Related Questions