tim_76
tim_76

Reputation: 96

Computing hamming distances on large data set

I have an input file of about 10^5 rows.
Each row is a sequence of 24 bits, i.e.:

1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 0

I need to compute the Hamming Distance for every pair of rows. Here's my first implementation using SciPy hamming function:

from scipy.spatial.distance import hamming

with open('input.txt', 'r') as file:
    reader = csv.reader(file, delimiter=' ')
    nodes = {}
    b = 24  # Number of bits
    for nodeNum, node in enumerate(reader):
        node[nodeNum] = [int(i) for i in node]

for u, uBits in nodes.items():
    for v, vBits in nodes.items():
        distance = hamming(uBits, vBits) * b
            # Do stuff

Second implementation I came up with:

node[nodeNum] = sum([int(bit)*2**power for power, bit in enumerate(node)])

Here I only store the decimal value but I then have to manually count set bits resulting from each XOR operation:

def hamming(a, b):
    N = a ^ b

    distance = 0
    ptr = 1

    while N:
        distance += ((N + 1) //
                           2 * ptr)
        N -= (N + 1) // 2
        ptr += 1

    return distance

How can I improve my code (both in terms of memory usage and runtime, ideally)?

Upvotes: 0

Views: 747

Answers (2)

AboAmmar
AboAmmar

Reputation: 5559

This may be the fastest you can do (watchout, for your data size it'd require to allocate 74.5 GiB of memory):

import numpy as np

nodes = []
with open('input.txt', 'r') as file:
    reader = csv.reader(file, delimiter=' ')
    for node in reader:
        nodes.append([int(i) for i in node])

dists = 2 * np.inner(nodes-0.5, 0.5-nodes) + nodes.shape[1] / 2

Just for fun, here is a 40X faster version in Julia:

using LoopVectorization, Tullio

function hamming!(nodes,dists)
    @tullio dists[i,j] = sum(nodes[i,k] ⊻ nodes[j,k])
end

n = 10^5
nodes = rand(Int8[0,1],n,24)
dists = Matrix{Int8}(undef,n,n)
@time hamming!(nodes,dists) # Run twice
  # 1.886367 seconds (114 allocations: 6.594 KiB)

While we're at it, I invite you to enter the world of Julia. It offers similar speeds to C++ and a pleasant syntax similar to Python.

Upvotes: 1

NickleDave
NickleDave

Reputation: 372

Why not just put the whole .csv into an array and then let scipy do all the work of computing pairwise distances?

import numpy as np
import pandas as pd
import scipy.spatial.distance

nodes = []
with open('input.txt', 'r') as file:
    reader = csv.reader(file, delimiter=' ')
    for node in reader:
        nodes.append([int(i) for i in node])

nodes = np.array(nodes)  # not strictly necessary
dists = scipy.spatial.distance.pdist(nodes, 'hamming')

Upvotes: 1

Related Questions