user3792088
user3792088

Reputation: 29

How to fast identify contiguous range of 1’s(Index) in huge binary data?

Could anyone suggest faster algorithm to identify continuous range of 1's in large binary data ?

Is traversing the data is the only solution? Traversing will give O(n) in worst case which I really don't want.

Does anyone can suggest faster algorithm?

As shown in below fig. I need to find the index 4000 which is start position of continuous range of 1's

index 0
|
00000000000000000000000000000000000000000011111100000

Upvotes: 1

Views: 83

Answers (2)

user541686
user541686

Reputation: 210593

Well, you can't avoid going through the entire data at least once (you have to look at everything at least!), but you can avoid going through it multiple times if you e.g. run-length encode the data.

Upvotes: 0

CapEnt
CapEnt

Reputation: 287

I could not think of anything that would not be O(n), since the data is always unsorted.

But, i can think of shortcuts, since you want a set of at least 3, and is binary data.

#include <iostream>

using namespace std;

int main()
{
    unsigned int seed = 3758096384; //11100000000000000000000000000000
    unsigned int testvar = 419307644; //00011000111111100010000001111100
    int result = 0;
    int continuous = 0;

    while (seed != 7 && (continuous == 1 || result == 0)) {
        if (seed == (testvar & seed)) {
            result |= seed;
            continuous = 1;
        } else
            continuous = 0;
        seed >>= 1;
    }
    // result = 16646144 or 00000000111111100000000000000000
    cout << result << endl;
    //the index, 8388608 or 00000000100000000000000000000000
    cout << (int)((result ^ (result >> 1)) & ~(result >> 1)) << endl;
    return 0;
}

How it works: It is a binary filter, it creates a mask of 3 bits, and continuous shift to left by 1 in every step of the loop.

So you have these numbers as filters:

3758096384 - 11100000000000000000000000000000
1879048192 - 01110000000000000000000000000000
939524096  - 00111000000000000000000000000000
...
14         - 00000000000000000000000000001110
7          - 00000000000000000000000000000111

Then it checks if the seed match with the result of a logical AND between the number tested and the seed itself (this filters all the numbers that don't match the filter).

If the seed and the AND match, it moves the seed to the result using a logical OR, and set a continuous to control the continuity of the sequence. The first time the result is not continuous, it breaks the loop.

In the end, you have the result and can calculate the index by:

1110
0111 SHIFT TO LEFT by 1 and XOR
1001
0111 NOT (SHIFT TO LEFT by 1) and AND
------------
1000

You will need to scan your 50gb data in 32bits chunks (easy to adapt to 64bits, or even do vectorization of it).

Upvotes: 1

Related Questions