Reputation: 1340
I am trying to solve this algorithmic problem:
https://dunjudge.me/analysis/problems/469/
For convenience, I have summarized the problem statement below.
Given an array of length (<= 2,000,000) containing integers in the range [0, 1,000,000], find the longest subarray that contains a majority element.
A majority element is defined as an element that occurs > floor(n/2) times in a list of length n.
Time limit: 1.5s
For example:
If the given array is [1, 2, 1, 2, 3, 2],
The answer is 5 because the subarray [2, 1, 2, 3, 2] of length 5 from position 1 to 5 (0-indexed) has the number 2 which appears 3 > floor(5/2) times. Note that we cannot take the entire array because 3 = floor(6/2).
The first thing that comes to mind is an obvious brute force (but correct) solution which fixes the start and end indexes of a subarray and loop through it to check if it contains a majority element. Then we take the length of the longest subarray that contains a majority element. This works in O(n^2) with a small optimization. Clearly, this will not pass the time limit.
I was also thinking of dividing the elements into buckets that contain their indexes in sorted order.
Using the example above, these buckets would be:
1: 0, 2
2: 1, 3, 5
3: 4
Then for each bucket, I would make an attempt to merge the indexes together to find the longest subarray that contains k as the majority element where k is the integer label of that bucket. We could then take the maximum length over all values of k. I didn't try out this solution as I didn't know how to perform the merging step.
Edit:
I solved this problem thanks to the answers of PhamTrung and hk6279. Although I accepted the answer from PhamTrung because he first suggested the idea, I highly recommend looking at the answer by hk6279 because his answer elaborates the idea of PhamTrung and is much more detailed (and also comes with a nice formal proof!).
Upvotes: 18
Views: 1872
Reputation: 23955
For completeness, here's an outline of an O(n)
theory. Consider the following, where *
are characters different from c
:
* c * * c * * c c c
i: 0 1 2 3 4 5 6 7 8 9
A plot for adding 1
for c
and subtracting 1
for a character other than c
could look like:
sum_sequence
0 c c
-1 * * c c
-2 * * c
-3 *
A plot for the minimum of the above sum sequence, seen for c
, could look like:
min_sum
0 c * *
-1 * c * *
-2 c c c
Clearly, for each occurrence of c
, we are looking for the leftmost occurrence of c
with sum_sequence
lower than or equal to the current sum_sequence
. A non-negative difference would mean c
is a majority, and leftmost guarantees the interval is the longest up to our position. (We can extrapolate a maximal length that is bounded by characters other than c
from the inner bounds of c
as the former can be flexible without affecting the majority.)
Observe that from one occurrence of c
to the next, its sum_sequence
can decrease by an arbitrary size. However, it can only ever increase by 1
between two consecutive occurrences of c
. Rather than each value of min_sum
for c
, we can record linear segments, marked by c
s occurrences. A visual example:
[start_min
\
\
\
\
end_min, start_min
\
\
end_min]
We iterate over occurrences of c
and maintain a pointer to the optimal segment of min_sum
. Clearly we can derive the next sum_sequence
value for c
from the previous one since it is exactly diminished by the number of characters in between.
An increase in sum_sequence
for c
corresponds with a shift of 1
back or no change in the pointer to the optimal min_sum
segment. If there is no change in the pointer, we hash the current sum_sequence
value as a key to the current pointer value. There can be O(num_occurrences_of_c)
such hash keys.
With an arbitrary decrease in c
's sum_sequence
value, either (1) sum_sequence
is lower than the lowest min_sum
segment recorded so we add a new, lower segment and update the pointer, or (2) we've seen this exact sum_sequence
value before (since all increases are by 1
only) and can use our hash to retrieve the optimal min_sum
segment in O(1)
.
As Matt Timmermans pointed out in the question comments, if we were just to continually update the pointer to the optimal min_sum
by iterating over the list, we would still only perform O(1)
amortized-time iterations per character occurrence. We see that for each increasing segment of sum_sequence
, we can update the pointer in O(1)
. If we used binary search only for the descents, we would add at most (log k)
iterations for every k
occurences (this assumes we jump down all the way), which keeps our overall time at O(n)
.
Upvotes: 0
Reputation: 1879
This elaborate and explain how attempt 2 in @PhamTrung solution is working
To get the length of longest subarray. We should
m
m
-1, length of given array)Concept
The attempt is stem from a method to solve longest positive subarray
We maintain an array of counter for each unique number x
. We do a +1
when we encounter x
. Otherwise, do a -1
.
Take array [0,1,2,0,0,3,4,5,0,0,1,0] and unique number 0
, we have array counter [1,0,-1,0,1,0,-1,-2,-1,0,-1,0]. If we blind those are not target unique number, we get [1,#,#,0,1,#,#,#,-1,0,#,0].
We can get valid array from the blinded counter array when there exist two counter such that the value of the right counter is greater than or equal to the left one. See Proof part.
To further improve it, we can ignore all # as they are useless and we get [1(0),0(3),1(4),-1(8),0(9),0(11)] in count(index) format.
We can further improve this by not record counter that is greater than its previous effective counter. Take counter of index 8,9 as an example, if you can form subarray with index 9, then you must be able to form subarray with index 8. So, we only need [1(0),0(3),-1(8)] for computation.
You can form valid subarray with current index with all previous index using binary search on counter array by looking for closest value that is less than or equal to current counter value (if found)
Proof
When right counter greater than left counter by r
for a particular x, where k,r >=0 , there must be k+r number of x
and k number of non x
exist after left counter. Thus
x
Procedure
m
= 1
- Create a new counter array [1(pi)]
- Create a new index record storing current index value (pi) and counter value (1)
- Calculate current counter value ci by cprev+2-(pi - pprev), where cprev,pprev are counter value and index value in index record
- Perform binary search to find the longest subarray that can be formed with current index position and all previous index position. i.e. Find the closest c, cclosest, in counter array where c<=ci. If not found, jump to step 5
Calculate number of
x
in the subarray found in step 2r = ci - cclosest
k = (pi-pclosest-r)/2
number of
x
= k+r+1- Update counter
m
by number ofx
if subarray has number ofx
> m- Update counter array by append current counter if counter value less than last recorded counter value
- Update index record by current index (pi) and counter value (ci)
Upvotes: 2
Reputation: 11284
Note: attempt 1 is wrong as @hk6279 has given a counter example. Thanks for pointing it out.
Attempt 1: The answer is quite complex, so I will discuss a brief idea
Let process each unique number one by one.
Processing each occurrence of number x
from left to right, at index i
, let add an segment (i, i)
indicates the start and end of the current subarray. After that, we need to look to the left side of this segment, and try to merge the left neighbour of this segment into (i, i)
, (So, if the left is (st, ed)
, we try to make it become (st, i)
if it satisfy the condition) if possible, and continue to merge them until we are not able to merge, or there is no left neighbour.
We keep all those segments in a stack for faster look up/add/remove.
Finally, for each segment, we try to enlarge them as large as possible, and keep the biggest result.
Time complexity should be O(n)
as each element could only be merged once.
Attempt 2:
Let process each unique number one by one
For each unique number x
, we maintain an array of counter. From 0 to end of the array, if we encounter a value x
we increase the count, and if we don't we decrease, so for this array
[0,1,2,0,0,3,4,5,0,0] and number 0
, we have this array counter
[1,0,-1,0,1,0,-1,-2,-1,0]
So, in order to make a valid subarray which ends at a specific index i
, the value of counter[i] - counter[start - 1]
must be greater than 0 (This can be easily explained if you view the array as making from 1 and -1 entries; with 1 is when there is an occurrence of x, -1 otherwise
; and the problem can be converted into finding the subarray with sum is positive)
So, with the help of a binary search, the above algo still have an complexity of O(n ^ 2 log n) (in case we have n/2 unique numbers, we need to do the above process n/2 times, each time take O (n log n))
To improve it, we make an observation that, we actually don't need to store all values for all counter, but just the values of counter of x
, we saw that we can store for above array counter:
[1,#,#,0,1,#,#,#,-1,0]
This will leads to O (n log n) solution, which only go through each element once.
Upvotes: 5
Reputation: 3640
Algorithm : Essentially, what Boyer-Moore does is look for a suffix sufsuf of nums where suf[0]suf[0] is the majority element in that suffix. To do this, we maintain a count, which is incremented whenever we see an instance of our current candidate for majority element and decremented whenever we see anything else. Whenever count equals 0, we effectively forget about everything in nums up to the current index and consider the current number as the candidate for majority element. It is not immediately obvious why we can get away with forgetting prefixes of nums - consider the following examples (pipes are inserted to separate runs of nonzero count).
[7, 7, 5, 7, 5, 1 | 5, 7 | 5, 5, 7, 7 | 7, 7, 7, 7]
Here, the 7 at index 0 is selected to be the first candidate for majority element. count will eventually reach 0 after index 5 is processed, so the 5 at index 6 will be the next candidate. In this case, 7 is the true majority element, so by disregarding this prefix, we are ignoring an equal number of majority and minority elements - therefore, 7 will still be the majority element in the suffix formed by throwing away the first prefix.
[7, 7, 5, 7, 5, 1 | 5, 7 | 5, 5, 7, 7 | 5, 5, 5, 5]
Now, the majority element is 5 (we changed the last run of the array from 7s to 5s), but our first candidate is still 7. In this case, our candidate is not the true majority element, but we still cannot discard more majority elements than minority elements (this would imply that count could reach -1 before we reassign candidate, which is obviously false).
Therefore, given that it is impossible (in both cases) to discard more majority elements than minority elements, we are safe in discarding the prefix and attempting to recursively solve the majority element problem for the suffix. Eventually, a suffix will be found for which count does not hit 0, and the majority element of that suffix will necessarily be the same as the majority element of the overall array.
Here's Java Solution :
Space complexity : O(1)
public int majorityElement(int[] nums) {
int count = 0;
Integer candidate = null;
for (int num : nums) {
if (count == 0) {
candidate = num;
}
count += (num == candidate) ? 1 : -1;
}
return candidate;
}
Upvotes: -1