Reputation: 1
I have one million binary sequences ,they are in the same length, such as (1000010011,1100110000....) and so on. And I want to know how many different groups they have(same sequences belong to same group ).what is the fastest way? No stoi please.
Upvotes: 0
Views: 45
Reputation: 2604
L < ~20: bucket sort
This is short enough in comparison to the input size. A bucketsort with L buckets is all you need. - preallocate an array of size 2L, since you have ~million sequences and 220 is ~million, you will only need O(n) of additional memory.
The time complexity will be O(n) with O(n) memory cost. This is optimal complexity-wise since you have to visit every element at least once to check its value anyway.
L reasonably large: hash table
If you pick a reasonable hashing function and a good size of the hash table(or a dictionary if we need to store the counts)1 you will have small number of collisions while inserting. The amortized time will be O(n) since if the hash is good, then the insert is amortized O(1).
As a side note, the bucket sort is technically a perfect hash since the hash function in this case is an one-to-one function.
L unreasonably large: binary tree
if for some reason the construction of hash is not feasible or you wish for consistency then building a binary tree to hold the values is a way to go.
This will take O(nlog(n)) as binary trees usually do.
1 ~2M should be enough and it is still O(n). Maybe you could go even lower to around 1,5M size.
Upvotes: 1