Reputation: 29
Could anyone suggest faster algorithm to identify continuous range of 1's in large binary data ?
Is traversing the data is the only solution? Traversing will give O(n)
in worst case which I really don't want.
Does anyone can suggest faster algorithm?
As shown in below fig. I need to find the index 4000 which is start position of continuous range of 1's
index 0
|
00000000000000000000000000000000000000000011111100000
Upvotes: 1
Views: 83
Reputation: 210593
Well, you can't avoid going through the entire data at least once (you have to look at everything at least!), but you can avoid going through it multiple times if you e.g. run-length encode the data.
Upvotes: 0
Reputation: 287
I could not think of anything that would not be O(n), since the data is always unsorted.
But, i can think of shortcuts, since you want a set of at least 3, and is binary data.
#include <iostream>
using namespace std;
int main()
{
unsigned int seed = 3758096384; //11100000000000000000000000000000
unsigned int testvar = 419307644; //00011000111111100010000001111100
int result = 0;
int continuous = 0;
while (seed != 7 && (continuous == 1 || result == 0)) {
if (seed == (testvar & seed)) {
result |= seed;
continuous = 1;
} else
continuous = 0;
seed >>= 1;
}
// result = 16646144 or 00000000111111100000000000000000
cout << result << endl;
//the index, 8388608 or 00000000100000000000000000000000
cout << (int)((result ^ (result >> 1)) & ~(result >> 1)) << endl;
return 0;
}
How it works: It is a binary filter, it creates a mask of 3 bits, and continuous shift to left by 1 in every step of the loop.
So you have these numbers as filters:
3758096384 - 11100000000000000000000000000000
1879048192 - 01110000000000000000000000000000
939524096 - 00111000000000000000000000000000
...
14 - 00000000000000000000000000001110
7 - 00000000000000000000000000000111
Then it checks if the seed match with the result of a logical AND between the number tested and the seed itself (this filters all the numbers that don't match the filter).
If the seed and the AND match, it moves the seed to the result using a logical OR, and set a continuous to control the continuity of the sequence. The first time the result is not continuous, it breaks the loop.
In the end, you have the result and can calculate the index by:
1110
0111 SHIFT TO LEFT by 1 and XOR
1001
0111 NOT (SHIFT TO LEFT by 1) and AND
------------
1000
You will need to scan your 50gb data in 32bits chunks (easy to adapt to 64bits, or even do vectorization of it).
Upvotes: 1