Reputation:

Calculating number distinct elements in a huge/big array

I have a large array say .. array[7000].At each index there is a number between 0 and 9.Then given two indices say x=100 , y= 105 I need to find the number of distinct elements between the given range. Assuming

array[100]=1;
array[101]=2;
array[102]=1;
array[103]=4;
array[104]=7;
array[105]=2;

now here the number of distinct elements is 4 {i.e numbers 1 ,2,1,7 }. This can be calculated by iterating between the two indices, and keeping track of the numbers.But the main problem I am facing is that the number of queries that the program has to perform is very large, i.e the operation has to be to done for several thousand values of x and y (say 10^6 queries), in that case t this method would become quiet inefficient.What can be a better algorithm, i tried searching this at many places.

Upvotes: 2

Answers (4)

Abhishek Bansal

Reputation: 12705

Ok this solution is not memory efficient but it will serve the purpose of calculating the queries in O(1).

Form another 2D array of 7000x10 in which each row is the count of [0 to 9s] upto that entry.

So whenever you get a query, you just have to subtract the respective indices of the rows to get the count of [0..9] numbers.

Check which indices have a non-zero difference of count.

For example, the array rows may look like:

A[36] = { 5, 4, 2,  1, 4, 3, 7, 8, 2, 1 }
A[38] = { 6, 4, 2,  1, 4, 3, 7, 8, 2, 2 }

Diff =  { 1, 0, 0,  0, 0, 0, 0, 0, 0, 1 }

So there are two unique elements (0 and 9) between 36 and 38.

Example 2:
For input array {1, 2, 3, 2, 4}, the matrix A shall be:

A[0] = { 0, 1, 0,  0, 0, 0, 0, 0, 0, 0 }
A[1] = { 0, 1, 1,  0, 0, 0, 0, 0, 0, 0 }
A[2] = { 0, 1, 1,  1, 0, 0, 0, 0, 0, 0 }
A[3] = { 0, 1, 2,  1, 0, 0, 0, 0, 0, 0 }
A[4] = { 0, 1, 2,  1, 1, 0, 0, 0, 0, 0 }

So for A[3] to A[4], there shall be 1 unique element (4)

Diff =  { 0, 0, 0,  0, 1, 0, 0, 0, 0, 0 }

So for A[1] to A[4], there shall be 3 unique element (2, 3, 4)

Diff =  { 0, 0, 1,  1, 1, 0, 0, 0, 0, 0 }

EDIT: I have considered the convention that A[1] to A[4] means unique elements in indices [2 to 4]. This can be changed to 1 to 4 if required. In that case,

Diff =  { 0, 0, 2,  1, 1, 0, 0, 0, 0, 0 }

(Again 3 unique elements)

Upvotes: 4

Tim

Reputation: 2151

Since your bottleneck is the query times, and the array element is from 0-9, here is my solution, in just O(n).

Assume:
A = arr
P = {} (Hash or Dict)
P[A,0,-1] = [0,0,0,0,0,0,0,0,0,0]   

For i = 0,len(A):
    e = A[i]
    P[A,0,i] = P[A,0,i-1] where P[A,0,i-1][e]++ 

Query(A,x,y)
    res = 0
    ax = P[A,0,x]
    ay = P[A,0,y]
    for i = 0,10
        res += ay[i] - ax[i] > 1 ? 1 : ay[i] - ax[i]    

    return res + 1

A = [1,2,1,4,7,2,2]

P[A,0,0] = [0,1,0,0,0,0,0,0,0,0]

P[A,0,1] = [0,1,1,0,0,0,0,0,0,0]

P[A,0,2] = [0,2,1,0,0,0,0,0,0,0]

P[A,0,3] = [0,2,1,0,1,0,0,0,0,0]

P[A,0,4] = [0,2,1,0,1,0,0,1,0,0]

P[A,0,5] = [0,2,2,0,1,0,0,1,0,0]

P[A,0,6] = [0,2,3,0,0,0,0,0,0,0]

Query(A,1,4) = [0,1,0,1,0,0,0,0,1,0,0] = 3+1 = 4

Query(A,4,6) = [0,0,2,0,0,0,0,0,0,0,0] = [0,0,1,0,0,0,0,0,0,0,0] = 1+1 = 2

Upvotes: 1

user555045

Reputation: 64903

Since the numbers are in the range 0-9, we can use some nice cheats.

The main idea: use the bit i in an integer to indicate the presence of the number i. So { } becomes 0, { 0 } be comes 1, { 1 } becomes 2, { 0, 1 } becomes 3, and so on. It's easy to add an item x to it, simply take set | (1 << x), and similarly, the union of two such sets is trivial: set1 | set2. There's also a nice way to calculate the number of 1 bits in an integer, which gives the cardinality of the set.

That doesn't help too much yet, it just improves the real-life performance of a query. But this isn't all.

Now build a binary tree, where at the leafs you take (1 << array[i]) | (1 << array[i + 1]) as the node. The level above, you take the union of the two children (if one child is not present, just take the value of the other child). This tree is a complete binary tree, so you should store it like a binary heap (it isn't a heap, of course) to avoid having to waste a ton of space on child pointers. This way you only need to store the integers that represent the sets, in an array that is no longer than the input. Building this tree obviously takes O(n) time.

For a query for the range x .. y, take the union of everything in that range* (be careful of edge cases), then calculate the population count of the final set (can be optimized a bit because only 10 bits can ever be set). This is an O(log(y - x)) operation.

*: to expand on that a bit, the idea is to recurse down the tree, and any time the range completely contains the current node, you return that node. If a leaf is not completely contained in the range, take
1 << array[something] (from the input array).

Upvotes: 0

Erti-Chris Eelmaa

Reputation: 26298

You are looking for a segmented trees.

http://en.wikipedia.org/wiki/Segment_tree

Start with the problem that each index has value between 0-1, and work your way up. You can answer then queries with O(logn) time and build the tree with O(nlogn) time.

Upvotes: 1

Calculating number distinct elements in a huge/big array

Answers (4)

Related Questions