Reputation: 21

The fastest way to count differing elements in two long vectors

I'm trying to compare 2 large vectors(integer) i.e. at each entry, see if two vectors have the same element or not. I've tried a few things, using an iterator to do the comparision and a simple for loop. Both works but I need something that will speed things up as I have to compare a lot of vectors. What's the best way to do that in C++?? Many thanks in advance!

typedef vector<int> fingerprint;

double aakernel(fingerprint a,fingerprint b, double h){

    double diff = 0;
    vector<int>::iterator dd = a.begin();
    vector<int>::iterator ee = b.begin();

    for(; dd != a.end() && ee != b.end() ;++dd, ++ee){ /*option one*/
        if (*dd!=*ee){
            diff++;
        }

    }


    for (int dd=0;dd<int(a.size());dd++){ /*option two*/
        if (a[dd]!=b[dd]){
            diff++;
        }
    }
    double due = (h/(1-h));
    double q = -log(due)*diff;
    double K = exp(q);
    return (K);
}

Upvotes: 2

Answers (5)

Tony Delroy

Reputation: 106216

1) You could use speed this up by dividing it into pieces and using different threads for each.

2) You could also explore the parallel processing machine opcodes, such as MMX, to see if they're applicable.

3) Depending on your compiler, its optimiser, CPU etc. you may or may not find significant performance benefits just from eliminating the branching: instead of...

if (*dd != *ee){
    diff++;
}

...try perhaps...

diff += bool(*dd - *ee);

It might be worth checking the assembly language of the if () version first to see if the optimiser is already doing this. If bool(*dd - *ee) still has branches you could try a few other things, falling back on inline assembly if necessary.

4) assuming you'll end up comparing the same vector to many others, you could store checksums/hashes of ranges within the data, such that when the same vector is compared to different alternatives only the regions with differing hashes are considered: this could miss some differences - about 1 in 2^bits for a good hash - but if this is for fingerprints I assume it's probabilistic anyway and this will be insignificant.

5) if you're doing this for the NSA, I recommend recoding in VBA.

Upvotes: 2

user2978391

Reputation: 21

Thanks a lot for all the different solutions! Much appreciated. I used diff as a double because at the end of the calculation it needs to be put in a kernel function and coming from a Python background I thought it would be better to assign it double in the first place but I might be wrong here but thanks for the comment!

Also, to elaborate on the fingerprint (which I should have done in the first place, my apologies) or maybe bitstring is a better word for it, each bit contains 1 or 0 in my case and I need to compare at each index whether the two bitstring are the same or not. Many thanks again for the solutions I'll try and see which one would help speed things up! Thanks a lot guys!

Upvotes: 0

Frerich Raabe

Reputation: 94429

In case the two fingerprint values are usually the same, it may help if you first do a

memcmp(&a[0], &b[0], a.size() * sizeof(int))

To test whether there's any difference between the two arrays at all. Only if there's any difference you go and look how many differences there are.

Upvotes: 1

Jiez

Reputation: 35

you don't need to write it by yourself since stl have certain functions to do that, check this

You can check more algorithm here:

http://www.cplusplus.com/reference/algorithm/

Upvotes: 0

Angew is no longer proud of SO

Reputation: 171167

If the vectors are otherwise arbitrary, you cannot get asymptotically better than sequentially comparing all elements, the way you do now. So you're left with micro-optimisations which may or may not improve performance (depending on how your compiler's optimiser handles them).

The only one I can think of is taking the non-changing evaluations out of the loop. (And perhaps also not using ++ on type double, but I believe the compiler will handle this optimally anyway):

double diff = 0;
for (
  auto itA = a.begin(), itB = b.begin(), endA = a.end();
  itA != endA;
  ++itA, ++itB
) {
  if (*itA != *itB) {
    diff += 1.0;
  }
}

Upvotes: 3

The fastest way to count differing elements in two long vectors

Answers (5)

Related Questions