Reputation: 1031
I am tasked with calculating hamming distances between 1D binary arrays in two groups - a group of 3000 arrays and a group of 10000 arrays, and every array is 100 items(bits) long. So thats 3000x10000 HD calculations on 100 bit long objects.And all that must be done in at most a dozen minutes
Here's the best of what I came up with
#X - 3000 by 100 bool np.array
#Y - 10000 by 100 bool np.array
hd = []
i=1
for x in X:
print("object nr " + str(i) + "/" + str(len(X)))
arr = np.array([x] * len(Y))
C = Y^arr # just xor this array by all the arrays in the other group simultainously
hd.append([sum(c) for c in C]) #add up all the bits to get the hamming distance
i+=1
return np.array(hd)
And it's still going to take 1-1.5 hours for it to finish. How do I go about making this faster?
Upvotes: 7
Views: 1869
Reputation: 1612
Just in case you are not limited to using Python, this is a solution in C++ using bitset
:
#include <iostream>
#include <bitset>
#include <vector>
#include <random>
#include <chrono>
using real = double;
std::mt19937_64 rng;
std::uniform_real_distribution<real> bitset_dist(0, 1);
real prob(0.75);
std::bitset<100> rand_bitset()
{
std::bitset<100> bitset;
for (size_t idx = 0; idx < bitset.size(); ++idx)
{
bitset[idx] = (bitset_dist(rng) < prob) ? true : false;
}
return std::move(bitset);
}
int main()
{
rng.seed(std::chrono::high_resolution_clock::now().time_since_epoch().count());
size_t v1_size(3000);
size_t v2_size(10000);
std::vector<size_t> hd;
std::vector<std::bitset<100>> vec1;
std::vector<std::bitset<100>> vec2;
vec1.reserve(v1_size);
vec2.reserve(v2_size);
hd.reserve(v1_size * v2_size); /// Edited from hd.reserve(v1_size);
for (size_t i = 0; i < v1_size; ++i)
{
vec1.emplace_back(rand_bitset());
}
for (size_t i = 0; i < v2_size; ++i)
{
vec2.emplace_back(rand_bitset());
}
std::cout << "vec1 size: " << vec1.size() << '\n'
<< "vec2 size: " << vec2.size() << '\n';
auto start(std::chrono::high_resolution_clock::now());
for (const auto& b1 : vec1)
{
for (const auto& b2 : vec2)
{
/// Count only the bits that are set and store them
hd.emplace_back((b1 ^ b2).count());
}
}
auto time(std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start).count());
std::cout << vec1.size() << " x " << vec2.size()
<< " xor operations on 100 bits took " << time << " ms\n";
return 0;
}
On my machine, the whole operation (3000
x 10000
) takes about 300
ms.
You could put this into a function, compile it into a library and call it from Python. Another option is to store the distances to a file and then read them in Python.
EDIT: I had the wrong size for the hd vector. Reserving the proper amount of memory reduces the operation to about 190
ms because relocations are avoided.
Upvotes: 1
Reputation: 59274
IIUC, you can use np.logical_xor
and list comprehension:
result = np.array([[np.logical_xor(X[a], Y[b].T).sum() for b in range(len(Y))]
for a in range(len(X))])
The whole operation runs in 7 seconds in my machine.
0:00:07.226645
Upvotes: 1
Reputation: 155566
You should be able to dramatically improve the summing speed by using numpy
to perform it, rather than using a list comprehension and the built-in sum
function (that takes no advantage of numpy
vectorized operations).
Just replace:
hd.append([sum(c) for c in C])
with:
# Explicitly use uint16 to reduce memory cost; if array sizes might increase
# you can use uint32 to leave some wiggle room
hd.append(C.sum(1, dtype=np.uint16))
which, for a 2D array, will return a new 1D array where each value is the sum of the corresponding row (thanks to specifying it should operate on axis
1
). For example:
>>> arr = np.array([[True,False,True], [False,False,True], [True, True,True]], dtype=np.bool)
>>> arr.sum(1, np.uint16)
array([ 2, 1, 3], dtype=uint16)
Since it performs all the work at the C layer in a single operation without type conversions (instead of your original approach that requires a Python level loop that operates on each row, then an implicit loop that, while at the C layer, must still implicitly convert each numpy
value one by one from np.bool
to Python level int
s just to sum them), this should run substantially faster for the array scales you're describing.
Side-note: While not the source of your performance problems, there is no reason to manually maintain your index value; enumerate
can do that more quickly and easily. Simply replace:
i=1
for x in X:
... rest of loop ...
i+=1
with:
for i, x in enumerate(X, 1):
... rest of loop ...
and you'll get the same behavior, but slightly faster, more concise and cleaner in general.
Upvotes: 5