Reputation: 4429

Data structure that supports range based most frequently occuring element query

I'm looking for a data structure with which I can find the most frequently occuring number (among an array of numbers) in a given, variable range.

Let's consider the following 1 based array:

1 2 3 1 1 3 3 3 3 1 1 1 1

If I query the range (1,4), the data structure must retun 1, which occurs twice. Several other examples:

(1,13) = 1

(4,9) = 3

(2,2) = 2

(1,3) = 1 (all of 1,2,3 occur once, so return the first/smallest one. not so important at the moment)

I have searched, but could not find anything similar. I'm looking (ideally) a data structure with minimal space requirement, fast preprocessing, and/or query complexities.

Thanks in advance!

Upvotes: 5

Answers (2)

mikera

Reputation: 106351

You could create a binary partition tree where each node represents a histogram map of {value -> frequency} for a given range, and has two child nodes which represent the upper half and lower half of the range.

Querying is then just a case of recursively adding together a small number of these histograms to cover the range required, and scanning the resulting histogram once to find the highest occurrence count.

Useful optimizations include:

Using a histogram with mutable frequency counts as an "accumulator" while you add histograms together
Stop using precomputed histograms once you get down to a certain size (maybe a range less than the total number of possible values M) and just counting the numbers directly. It's a time/space trade-off that I think will pay off a lot of the time.
If you have a fixed small number of possible values, use an array rather than a map to store the frequency counts at each node

UPDATE: my thinking on algorithmic complexity assuming a bounded small number of possible values M and a total of N values in the complete range:

Preprocessing is O(N log N) - basically you need to traverse the complete list and build a binary tree, building one node for every M elements in order to amortise the overhead of each node
Querying is O(M log N) - basically adding up O(log N) histograms each of size M, plus counting O(M) values on either side of the range
Space requirement is O(N) - approx. 2N/M histograms each of size M. The 2 factor is the sum from having N/M histograms at the bottom level, 0.5N/M histograms at the next level, 0.25N/M at the third level etc...

Upvotes: 0

Loïc Février

Reputation: 7750

Let N be the size of the array and M the number of different values in that array.

I'm considering two complexities : pre-processing and querying an interval of size n, each must be spacial and temporal.

Solution 1 :

Spacial : O(1) and O(M)
Temporal : O(1) and O(n + M)

No pre-processing, we look at all values of the interval and find the most frequent one.

Solution 2 :

Spacial : O(M*N) and O(1)
Temporal : O(M*N) and O(min(n,M))

For each position of the array, we have an accumulative array that gives us for each value x, how many times x is in the array before that position.

Given an interval we just need for each x to subtract 2 values to find the number of x in that interval. We iterate over each x and find the maximum value. If n < M we iterate over each value of the interval, otherwise we iterate over all possible values for x.

Solution 3 :

Spacial : O(N) and O(1)
Temporal : O(N) and O(min(n,M)*log(n))

For each value x build a binary heap of all the position in the array where x is present. The key in your heap is the position but you also store the total number of x between this position and the begin of the array.

Given an interval we just need for each x to subtract 2 values to find the number of x in that interval : in O(log(N)) we can ask the x's heap to find the two positions just before the start/end of the interval and substract the numbers. Basically it needs less space than a histogram but the query in now in O(log(N)).

Upvotes: 2

Data structure that supports range based most frequently occuring element query

Answers (2)

Related Questions