Modular run-length encoding

Question

Question

How to implement a run length encoding modulus n>=1? For n=4, considering the inputAAABBBBABCCCCBBBDAAA, we want an output of [('D', 1), ('A', 3)]. Note the long-distance merging due to the modulus operation. See Explanation.

Explanation

The first occurance of BBBB encodes to (B, 4) whose modulus 4 is (B, 0), thus canceling itself out. See the diagram (ignore spaces; they are simply for illustrative purposes):

AAABBBBABCCCCBBBDAAA
A3  B4 ABCCCCBBBDAAA
A3  B0 ABCCCCBBBDAAA
A3     ABCCCCBBBDAAA
A4      BCCCCBBBDAAA
A0      BCCCCBBBDAAA
        BCCCCBBBDAAA
        ... 
                DA3

A simpler example when no merging happens since none gets canceled by modulus 4: input AAABABBBC produces output [('A',3),('B',1),('A',1),('B',3),('C',1)].

Requirements

Haskell implementations are preferred but others are welcome too!
Prefer standard/common library functions over 3rd party libraries.
Prefer readable and succint programs utilizing higher-order functions.
Prefer efficiency (do not loop over the whole list whenever unnecessary)

My program

I implemented this in Haskell, but it looks too verbose and awful to read. The key idea is to check three tuples at a time, and only advance one tuple forward if we can neither cancel out 0 tuples nor merge a pair of tuples among the three tuples at hand.

import Data.List (group)

test = [('A', 1), ('A', 2), ('B', 2), ('B', 2), ('A', 1), ('B', 1), ('C', 1), ('C', 3), ('B', 3), ('D', 1), ('A', 3)] :: [(Char, Int)]
expected = [('D', 1), ('A', 3)] :: [(Char, Int)]


reduce' :: [(Char, Int)] -> [(Char, Int)]
reduce' [           ] = []                                           -- exit
reduce' (   (_,0):xs) = reduce' xs
reduce' (x1:(_,0):xs) = reduce' (x1:xs)
reduce' (   (x,n):[]) = (x,n):[]                                     -- exit

reduce' (        (x1,n1):(x2,n2):[])              -- [previous,current,NONE]
  | x1 == x2  = reduce' ((x1, d4 (n1+n2)):[])
  | otherwise = (x1,n1):(                                         -- advance
                reduce' ((x2, d4     n2 ):[]))

reduce' ((x1,n1):(x2,n2):(x3,n3):xs)              -- [previous,current,next]
  | n3 == 0   = reduce' ((x1, d4  n1    ):(x2, d4  n2    ):xs)
  | n2 == 0   = reduce' ((x1, d4  n1    ):(x3, d4     n3 ):xs)
  | x2 == x3  = reduce' ((x1, d4  n1    ):(x2, d4 (n2+n3)):xs)
  | x1 == x2  = reduce' ((x2, d4 (n1+n2)):(x3, d4     n3 ):xs)
  | otherwise = (x1,n1):(                                         -- advance
                reduce' ((x2, d4  n2    ):(x3, d4     n3 ):xs)
                )

-- Helpers
flatten :: [(Char, Int)] -> String
flatten nested = concat $ (\(x, n) -> replicate n x) <$> nested

nest :: String -> [(Char, Int)]
nest flat = zip (head <$> xg) (d4 .length <$> xg)
  where xg = group flat

reduce = reduce' . nest
d4 = (`rem` 4)

Thoughts

My inputs are like the test variable in the snipped above. We could keep doing flatten then nest until its result doesn't change, and would definitely look simpler. But it feels it is scanning the whole list many times, while my 3-pointer implementation scans the whole list only once. Maybe we can pop an element from left and add it to a new stack while merging identical consecutive items? Or maybe use Applicative Functors? E.g. this works but not sure about its efficiency/performance: reduce = (until =<< ((==) =<<)) (nest . flatten).

amalloy · Accepted Answer

Algorithm

I think you are making this problem much harder by thinking of it in terms of character strings at all. Instead, do a preliminary pass that just does the boring RLE part. This way, a second pass is comparatively easy, because you can work in "tokens" that represent a run of a certain length, instead of having to work one character at a time.

The only data structure we need to maintain as we do the second pass through the list is a stack, and we only ever need to look at its top element. We compare each token that we're examining with the top of the stack. If they're the same, we blend them into a single token representing their concatenation; otherwise, we simply push the next token onto the stack. In either case, we reduce token sizes mod N and drop tokens with size 0.

Performance

This algorithm runs in linear time: it processes each input token exactly once, and does a constant amount of work for each token.
It cannot produce output lazily. There is no prefix of the input that is sufficient to confidently produce a prefix of the output, so we have to wait till we have consumed the entire input to produce any output. Even something that "looks bad" like ABCABCABCABCABC can eventually be cancelled out if the rest of the string is CCCBBBAAA....
The reverse at the end is a bummer, but amortized over all the tokens it is quite cheap, and in any case does not worsen our linear-time guarantee. It likewise does not change our space requirements, since we already require O(N) space to buffer the output (since as the previous note says, it's never possible to emit a partial result).

Correctness

Writing down my remark about laziness made me think of your reduce solution, which appears to produce output lazily, which I thought was impossible. The explanation, it turns out, is that your implementation is not just inelegant, as you say, but also incorrect. It produces output too soon, missing chances to cancel with later elements. The simplest test case I can find that you fail is reduce "ABABBBBAAABBBAAA" == [('A',1),('A',3)]. We can confirm that this is due to yielding results too early, by noting that take 1 $ reduce ("ABAB" ++ undefined) yields [(1, 'A')] even though elements might come later that cancel with that first A.

Minutiae

Finally note that I use a custom data type Run just to give a name to the concept; of course you can convert this to a tuple cheaply, or rewrite the function to use tuples internally if you prefer.

Implementation

import Data.List (group)

data Run a = Run Int a deriving Show

modularRLE :: Eq a => Int -> [a] -> [Run a]
modularRLE groupSize = go [] . tokenize
  where go stack [] = reverse stack
        go stack (Run n x : remainder) = case stack of
          [] -> go (blend n []) remainder
          (Run m y : prev) | x == y -> go (blend (n + m) prev) remainder
                           | otherwise -> go (blend n stack) remainder
          where blend i s = case i `mod` groupSize of
                              0 -> s
                              j -> Run j x : s
        tokenize xs = [Run (length run) x | run@(x:_) <- group xs]
λ> modularRLE 4 "AAABBBBABCCCCBBBDAAA"
[Run 1 'D',Run 3 'A']
λ> modularRLE 4 "ABABBBBAAABBBAAA"
[]