Efficient pairwise comparisons - rows of Numpy 2D array

Question

I would like to compare each row of a Numpy 2D array with all other rows and get an output of a binary matrix, that indicates the non-matching features of each pair of rows.

Perhaps, for an input:

 index col1 col2 col3 col4
   0    2    1    3    3
   1    2    3    3    4
   2    4    1    3    2

I would like to get the following output:

 index col1 col2 col3 col4  i  j
   0    0    1    0    1    0  1
   1    1    0    0    1    0  2
   2    1    1    0    1    1  2

As 'i' and 'j' hold the original indexes of the compared rows

What is the most efficient way to implement this?

My current implementation takes too long due to a "for" loop:

df = pd.DataFrame([[2,1,3,3],[2,3,3,4],[4,1,3,2]],columns=['A','B','C','D']) # example of a dataset
r = df.values
rows, cols = r.shape
additional_cols = ['i', 'j'] # original df indexes
allArrays = np.empty((0, cols + len(additional_cols)))

for i in range(0, rows):
        myArray = np.not_equal(r[i, :], r[i+1:, :]).astype(np.float32)
        myArray_with_idx = np.c_[myArray, np.repeat(i, rows-1-i), np.arange(i+1, rows)] # save original df indexes
        allArrays = np.concatenate((allArrays, myArray_with_idx), axis=0)

Efficient pairwise comparisons - rows of Numpy 2D array

Answers (1)

Related Questions