Computing hamming distances on large data set

Question

I have an input file of about 10^5 rows.
Each row is a sequence of 24 bits, i.e.:

1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 0

I need to compute the Hamming Distance for every pair of rows. Here's my first implementation using SciPy hamming function:

from scipy.spatial.distance import hamming

with open('input.txt', 'r') as file:
    reader = csv.reader(file, delimiter=' ')
    nodes = {}
    b = 24  # Number of bits
    for nodeNum, node in enumerate(reader):
        node[nodeNum] = [int(i) for i in node]

for u, uBits in nodes.items():
    for v, vBits in nodes.items():
        distance = hamming(uBits, vBits) * b
            # Do stuff

Second implementation I came up with:

node[nodeNum] = sum([int(bit)*2**power for power, bit in enumerate(node)])

Here I only store the decimal value but I then have to manually count set bits resulting from each XOR operation:

def hamming(a, b):
    N = a ^ b

    distance = 0
    ptr = 1

    while N:
        distance += ((N + 1) //
                           2 * ptr)
        N -= (N + 1) // 2
        ptr += 1

    return distance

How can I improve my code (both in terms of memory usage and runtime, ideally)?

AboAmmar · Accepted Answer

This may be the fastest you can do (watchout, for your data size it'd require to allocate 74.5 GiB of memory):

import numpy as np

nodes = []
with open('input.txt', 'r') as file:
    reader = csv.reader(file, delimiter=' ')
    for node in reader:
        nodes.append([int(i) for i in node])

dists = 2 * np.inner(nodes-0.5, 0.5-nodes) + nodes.shape[1] / 2

Just for fun, here is a 40X faster version in Julia:

using LoopVectorization, Tullio

function hamming!(nodes,dists)
    @tullio dists[i,j] = sum(nodes[i,k] ⊻ nodes[j,k])
end

n = 10^5
nodes = rand(Int8[0,1],n,24)
dists = Matrix{Int8}(undef,n,n)
@time hamming!(nodes,dists) # Run twice
  # 1.886367 seconds (114 allocations: 6.594 KiB)

While we're at it, I invite you to enter the world of Julia. It offers similar speeds to C++ and a pleasant syntax similar to Python.

Computing hamming distances on large data set

Answers (2)

Related Questions