Maik Martinez
Maik Martinez

Reputation: 13

hamming distance python improvement

I am trying to write hamming distance.
As input I do have two matrices M1 and M2, one is 40x20, second 50x20; they contain True/False. I need to calc distance between each row, so M1[0] distance with M2[0], M1[0] with M2[1] ... M1[39] with M2[49]. Resulting in 40x50 result matrix. My first try was of course looping like:

for x_i in range(X.shape[0]):
        for x_train_j in range(X_train.shape[0]):
            distance_results[x_i, x_train_j] = sum(np.logical_xor(x_array[x_i, :], x_train_array[x_train_j, :]))

it is correct, however, it is too slow for my data. I have some thoughts about multiplying M1.dot(M2.T), which gives right shape of matrix and does summing in one step but when multiplying and summing xor is desired so:
1 * 1 = 1 (need 0 - bad)
1 * 0 = 0 (need 1 - bad)
0 * 1 = 0 (need 1 - bad)
0 * 0 = 0 (need 0 - is OK)
Do you have any idea how can I get desired result? I think i miss some math to do it right and fast. Thanks in advance.

Upvotes: 1

Views: 490

Answers (1)

gbtimmon
gbtimmon

Reputation: 4332

M1.dot(M2.T) could be in a way considered a AND for Boolean logic. For binary logic A xor B is equivalent to (A and not B) or (not A and B) so therefore one thing you could do --

M1.dot(M2.T):
1 * 1 = 1 
1 * 0 = 0 
0 * 1 = 0
0 * 0 = 0

(not M1).dot(M2.T):
0 * 1 = 0
0 * 0 = 0 
1 * 1 = 1
1 * 0 = 0

M1.dot(not M2.T):
1 * 0 = 0 
1 * 1 = 1
0 * 0 = 0
0 * 1 = 0

(not M1).dot(M2.T) or M1.dot(not M2.T):
0 * 1 || 1 * 0 = 0
0 * 0 || 1 * 1 = 1 
1 * 1 || 0 * 0 = 1
1 * 0 || 0 * 1 = 0

To code it up you need to be careful to cast out of logical before the dot so that the sum is calculated. If you dot on 2 logical arrays you wont get the result you want. I just use -1 as my not operator ( True - 1 = 0, False - 1 = -1). Also the 'or' here is really a sum since you are considering multiple bools. At the end you end up with your distances in the negative, easy to fix with absolute.

_A = [[ True,  True,  False],
      [ True,  False, True]]

_B = [[ True,  False, False],
      [ False, False, False],
      [ True,  True,  True ]]

Expected output: [[ 1, 2, 1], [ 1, 2, 1]]

A = numpy( _A, dtype=bool)
B = numpy( _B, dtype=bool)

numpy.absolute(numpy.dot(A,B.T-1) + numpy.dot(A-1, B.T))

>>>array([[1, 2, 1],
          [1, 2, 1]])

Upvotes: 1

Related Questions