Reputation: 31
I would like to compare each row of a Numpy 2D array with all other rows and get an output of a binary matrix, that indicates the non-matching features of each pair of rows.
Perhaps, for an input:
index col1 col2 col3 col4
0 2 1 3 3
1 2 3 3 4
2 4 1 3 2
I would like to get the following output:
index col1 col2 col3 col4 i j
0 0 1 0 1 0 1
1 1 0 0 1 0 2
2 1 1 0 1 1 2
As 'i' and 'j' hold the original indexes of the compared rows
What is the most efficient way to implement this?
My current implementation takes too long due to a "for" loop:
df = pd.DataFrame([[2,1,3,3],[2,3,3,4],[4,1,3,2]],columns=['A','B','C','D']) # example of a dataset
r = df.values
rows, cols = r.shape
additional_cols = ['i', 'j'] # original df indexes
allArrays = np.empty((0, cols + len(additional_cols)))
for i in range(0, rows):
myArray = np.not_equal(r[i, :], r[i+1:, :]).astype(np.float32)
myArray_with_idx = np.c_[myArray, np.repeat(i, rows-1-i), np.arange(i+1, rows)] # save original df indexes
allArrays = np.concatenate((allArrays, myArray_with_idx), axis=0)
Upvotes: 2
Views: 1706
Reputation: 221524
Approach #1 : Here's one with np.triu_indices
-
a = df.values
R,C = np.triu_indices(len(a),1)
out = np.concatenate((a[R] != a[C],R[:,None],C[:,None]),axis=1)
Approach #2 : We can also make use of slicing
and iteratively fill-in -
a = df.values
n = a.shape[0]
N = n*(n-1)//2
idx = np.concatenate(( [0], np.arange(n-1,0,-1).cumsum() ))
start, stop = idx[:-1], idx[1:]
out = np.empty((N,a.shape[1]+2),dtype=a.dtype)
for j,i in enumerate(range(n-1)):
s0,s1 = start[j],stop[j]
out[s0:s1,:-2] = a[i,None] != a[i+1:]
out[s0:s1,-2] = j
out[s0:s1,-1] = np.arange(j+1,n)
out
would be your allArrays
.
Upvotes: 1