krlmlr
krlmlr

Reputation: 25454

Find longest substring in binary string with not less ones than zeros

How to find, in a binary string, the longest substring where the balance, i.e. the difference between the number of ones and zeros, is >= 0?

Example:

01110000010 -> 6: 011100

1110000011110000111 -> 19: entire string

While this problem looks very similar to the Maximum Value Contiguous Subsequence (Maximum Contiguous Sum) problem, a dynamic programming solution doesn't seem to be obvious. In a divide-and-conquer approach, how to do the merging? Is an "efficient" algorithm possible after all? (A trivial O(n^2) algorithm will just iterate over all substrings for all possible starting points.)

This is a modified variant of Finding a substring, with some additional conditions. The difference is that in the linked question, only such substrings are allowed where balance never falls below zero (looking at the string in either forward or backward direction). In the given problem, balance is allowed to fall below zero, provided it recovers at some later stage.

Upvotes: 3

Views: 2374

Answers (5)

krlmlr
krlmlr

Reputation: 25454

Dynamic programming -- linear run time (finally!)

inspired by this blog post. Simple and efficient, one-pass online algorithm, but takes some time to explain.

Idea

The link above shows a different problem: Maximum subsequence sum. It cannot be mapped 1:1 to the given problem, here a "state" of O(n) is needed, in contrast to O(1) for the original problem. Still, the state can be updated in O(1).

Let's rephrase the problem. We are looking for the longest substring in the input where the balance, i.e. the difference between 0's and 1's, is greater than zero.

The state is similar to my other divide-and-conquer solution: We compute, for each position i and for each possible balance b the starting position s(i, b) of the longest string with balance b or greater that ends at position i. That is, the string that starts at index s(i, b) + 1 and ends at i has balance b or greater, and there is no longer such string that ends at i. We find the result by maximizing i - s(i, 0).

Algorithm

Of course, we do not keep all s(i, b) in memory, just those for the current i (which we iterate over the input). We start with s(0, b) := 0 for b <= 0 and := undefined for b > 0. For each i, we update with the following rule:

  1. If 1 is read: s(i, b) := s(i - 1, b - 1).
  2. If 0 is read: s(i, b) := s(i - 1, b + 1) if defined, s(i, 0) := i if s(i - 1, 1) undefined.

The function s (for current i) can be implemented as a pointer into an array of length 2n + 1; this pointer is moved forward or backward depending on the input. At each iteration, we note the value of s(i, 0).

How does it work?

The state function s becomes effective especially if the balance from the start to i is negative. It records the earliest start point where zero balance is reached, for all possible numbers of 1s that have not been read yet.

Why does it work?

Because the recursive definition of the state function is equivalent to its direct definition -- the starting position of the longest string with balance b or greater that ends at position i.

Why is the recursive definition correct?

Proof by induction.

Upvotes: 1

krlmlr
krlmlr

Reputation: 25454

Divide and conquer

Another classic. Should run in O(n log n), but rather difficult to implement.

Idea

The longest feasible substring is either in the left half, in the right half, or it passes over the boundary. Call the algorithm for both halves. For the boundary:

Assume problem size n. For the longest feasible substring that crosses the boundary, we are going to compute the balance of the left-half part of the substring.

Determine, for each possible balance between -n/2 and n/2, in the left half, the length of the longest string that ends at the boundary and has this (or a larger) balance. (Linear time!) Do the same for the right half and the longest string that starts at the boundary. The result is two arrays of size n + 1; we reverse one of them, add them element-wise and find the maximum. (Again, linear.)

Why does it work?

A substring with balance >= 0 that crosses the boundary can have balance < 0 in either the left or the right part, if the other part compensates this. ("Borrowing" balance.) The crucial question is how much to borrow; we iterate over all potential "balance credits" and find the best trade-off.

Why is this O(n log n)?

Because merging (looking at boundary-crossing string) takes only linear time.

Why is merging O(n)?

Exercise left to the reader.

Upvotes: 0

krlmlr
krlmlr

Reputation: 25454

Compressed quadratic runtime

We will be looking for (locally) longest substrings with balance zero, starting at the beginning. We will ignore strings of zeros. (Corner cases: All zeros -> empty string, balance never reaches zero again -> entire string.) Of these substrings with balance zero, all trailing zeros will be removed.

Denote by B a substring with balance > 0 and by Z a substring with only zeros. Each input string can be decomposed as follows (pseudo-regex notation):

B? (Z B)* Z?

Each of the Bs is a maximum feasible solution, meaning that it cannot be extended in either direction without reducing balance. However, it might be possible to collapse sequences of BZB or ZBZ if the balance is still larger than zero after collapsing.

Note that it is always possible to collapse sequences of BZBZB to a single B if the ZBZ part has balance >= 0. (Can be done in one pass in linear time.) Once all such sequences have been collapsed, the balance of each ZBZ part is below zero. Still, it is possible that there exist BZB parts with balance above zero -- even that in a BZBZB sequence with balance below zero both the leading and trailing BZB parts have balance over zero. At this point, it seems to be difficult to decide which BZB to collapse.

Still quadratic...

Anyway, with this simplified data structure one can try all Bs as starting points (possibly extending to the left if there's still balance left). Run time is still quadratic, but (in practice) with a much smaller n.

Upvotes: 0

justhalf
justhalf

Reputation: 9107

This can be answered quite easily in O(n) using "height array", representing the number of 1's relative to the number of 0's. Like my answer in the linked question.

Now, instead of focusing on the original array, we now focus on two arrays indexed by the heights, and one will contain the smallest index such height is found, and the other will contain the largest index such height is found. Since we don't want a negative index, we can shift everything up, such that the minimum height is 0.

So for the sample cases (I added two more 1's at the end to show my point):

1110000011010000011111
Array height visualization
  /\
 /  \
/    \
      \  /\/\        /
       \/    \      /
              \    /
               \  /
                \/
(lowest height = -5)
Shifted height array:
[5, 6, 7, 8, 7, 6, 5, 4, 3, 4, 5, 4, 5, 4, 3, 2, 1, 0, 1, 2, 3]
     Height:   0  1  2  3  4  5  6  7  8
first_view = [17,16,15, 8, 7, 0, 1, 2, 3]
last_view  = [17,18,19,20,21,22, 5, 4, 3]

note that we have 22 numbers and 23 distinct indices, 0-22, representing the 23 spaces between and padding the numbers

We can build the first_view and last_view array in O(n).

Now, for each height in the first_view, we only need to check every larger heights in last_view, and take the index with maximum difference from the first_view index. For example, from height 0, the maximum value of index in larger heights is 22. So the longest substring starting at index 17+1 will end at index 22.

To find the maximum index on the last_view array, you can convert it to a maximum to the right in O(n):

last_view_max = [22,22,22,22,22,22, 5, 4, 3]

And so finding answer is simply subtracting first_view from last_view_max,

first_view    = [17,16,15, 8, 7, 0, 1, 2, 3]
last_view_max = [22,22,22,22,22,22, 5, 4, 3]
result        = [ 5, 6, 7,14,15,22, 4, 2, 0]

and taking the maximum (again in O(n)), which is 22, achieved from starting index 0 to ending index 22, i.e., the whole string. =D

Proof of correctness:

Suppose that the maximum substring starts at index i, ends at index j. If the height at index i is the same as the height at index k<i, then k..j would be a longer substring still satisfying the requirement. Therefore it suffices to consider the first index of each height. Analogously for the last index.

Upvotes: 4

Zruty
Zruty

Reputation: 8667

I have a solution that requires O(n) additional memory and O(n) time.

Let's denote the 'height' of an index h(i) as

h(i) = <number of 1s in the substring 1..i> - <number of 0s in the same substring>

The problem can now be reformulated as: find i and j such as h(i) <= h(j) and j-i -> max.

Obviously, h(0) = 0, and if h(n) = 0, then the solution is the entire string.

Now let's compute the array B so that B[x] = min{i: h(i) = -x}. In other words, let B[x] be the leftmost index i at which h(i)= -x.

The array B[x] has a length of at most n, and is computed in one linear pass.

Now we can iterate over the original string and for each index i compute the length of the longest sequence with non-negative balance that ends on i as follows:

Lmax(i) = i - B[MIN{0, h(i)}]

The largest Lmax(i) across all i will give you the desired length.

I leave the proof as an exercise :) Contact me if you can't figure it out.

Also, my algorithm needs 2 passes of the original string, but you can collapse them into one.

Upvotes: 6

Related Questions