sam_33
sam_33

Reputation: 595

Counting unique element in large array

One of my colleagues was asked this question in an interview.

Given a huge array which stores unsigned int. Length of array is 100000000. Find the effective way to count the unique number of elements present in the array.

E.g arr = {2,34,5,6,7,2,2,5,1,34,5}

O/p: Count of 2 is 3, Count of 34 is 2 and so on.

What are effective algorithms to do this? I thought at first dictionary/hash would be one of the options, but since the array is very large it is inefficient. Is there any way to do this?

Upvotes: 5

Views: 10186

Answers (7)

mat
mat

Reputation: 11

How about using a BloomFilter impl: like http://code.google.com/p/java-bloomfilter/ first do a bloom.contains(element) if true continue if false bloom.add(element).

At the end count the number of elements added. Bloomfilter needs approx. 250mb memory to store 100000000 elements at 10bits per element.

Problem is that false positives are possible in BloomFilters and can only be minimized by increasing the number of bits per element. This could be addressed by two BloomFilters with different hashing that need to agree.

Upvotes: 1

templatetypedef
templatetypedef

Reputation: 372784

Many other posters have suggested sorting the data and then finding the number of adjacent values, but no one has mentioned using radix sort yet to get the runtime to be O(n lg U) (where U is the maximum value in the array) instead of O(n lg n). Since lg U = O(lg n), assuming that integers take up one machine word, this approach is asymptotically faster than heapsort.

Non-comparison sorts are always fun in interviews. :-)

Upvotes: 7

ChrisJ
ChrisJ

Reputation: 5251

If the range of the int values is limited, then you may allocate an array, which serves to count the occurrences for each possible value. Then you just iterate through your huge array and increment the counters.

foreach x in huge_array {
   counter[x]++;
}

Thus you find the solution in linear time (O(n)), but at the expense of memory consumption. That is, if your ints span the whole range allowed by 32-bit ints, you would need to allocate an array of 4G ints, which is impractical...

Upvotes: 1

mmatloka
mmatloka

Reputation: 2014

Sorting is a good idea. However type of sorting depends on range of possible values. For small range counting sort would be good. While dealing with such a big array it would be efficient to use multiple cores - radix sort might be good.

Upvotes: 0

Murilo Vasconcelos
Murilo Vasconcelos

Reputation: 4827

Hashing in this case is not inneficient. The cost will be approximately O(N) (O(N) for iterating over the array and ~O(N) for iterating over the hashtable). Since you need O(N) for checking each element, the complexity is good.

Upvotes: 0

Girish Rao
Girish Rao

Reputation: 2669

Heap sort is O(nlogn) and in-place. In-place is necessary when dealing with large data sets. Once sorted you can make one pass through the array tallying occurrences of each value. Because the array is sorted, once a value changes you know you've seen all occurrences of the previous value.

Upvotes: 10

payne
payne

Reputation: 14177

Sort it, then scan it from the beginning to determine the counts for each item.

This approach requires no additional storage, and can be done in O(n log n) time (for the sort).

Upvotes: 2

Related Questions