Alejandro
Alejandro

Reputation: 949

Most efficient way to clean and reorder two arrays based on their matching selected columns

say we have array1 and array2 both of which are two dimensional, and may have non-unique rows, and different number of rows.

My final goal is to have a cleaned version of the two arrays with the same shape, and ordered such that for each row index the values in column 2,3, and 4 are the same.

In below I describe a possible sequence to achieve this final goal which I am wondering about the most efficient way for in using numpy.

1_if there are rows in array1 with similar values in column 2,3,4, remove them.

2_if there are rows in array2 with similar values in column 2,3,4, remove them.

So based on those columns, both arrays will have unique rows.

3_then I want to remove rows which in both arrays that do not have a matching row in the other array in terms of column 2,3,4.

So both arrays should have the same length now.

4_Then I want to reorder array1 so that with the same indecies array2 has the same values in column 2,3,4.

-------------edit: numerical example:

array1 = array([1,4,3, 64356,5435,434],
               [11,46,3, 7356,585,74],
               [51,406,3, 769,5435,24],
               [12,45,5, 656,135,134],
               [112,475,5, 656,1385,134],
               [13,46,  5, 656,1385,19]])


array2 = array([15,44,  5, 656, 1385, 434],
               [165,644,5, 656, 1385, 48],
               [151,436,3, 356, 285,74],
               [521,406,5, 656, 135,24],
               [152,445,54, 56,635,134],
               [1812,757,542, 546,185,1834],
               [72,77,142, 66,65,64],
               [72,727,12, 16,55,634]])

array1_final = array([112,475,5, 656,1385,134],
                     [12,45,  5, 656,135,134]
                ])

array2_final = array([15,44,  5,  656,1385,434],
                     [521,406,5, 656,135,24]
                ])

although array2[0] and array2[1] both have a match array1[4] in terms of their 2,3,4 columns, only one of them is kept in the final array2. Similarly , array1[5] was dropped. The final arrays are in the same order in terms of matching 2,3,4 columns. The rest are dropped because they don't have a matching counterpart in the other array in terms of 2,3,4 columns.

Upvotes: 0

Views: 95

Answers (1)

MBeale
MBeale

Reputation: 750

I have an answer, although admittedly there may be a better one out there.

#find the unique rows
array1_v,array_i = np.unique(array1[:,[2,3,4]], axis=0, return_index=True)
array2_v,array2_i = np.unique(array2[:,[2,3,4]], axis=0, return_index=True)

#find if the unique rows exist in the other array
array1_in_array2 = [row.tolist() in array2_v.tolist() for row in array1_v] array2_in_array1 = [row.tolist() in array1_v.tolist() for row in array2_v]
array2_in_array1 = [row.tolist() in array1_v.tolist() for row in array2_v] array2_in_array1 = [row.tolist() in array1_v.tolist() for row in array2_v]

#final results
array1_final = array1[array1_i[array1_in_array2]]
array2_final = array2[array2_i[array2_in_array1]]
>>> array1_final
array([[  12,   45,    5,  656,  135,  134],
       [ 112,  475,    5,  656, 1385,  134]])

Upvotes: 1

Related Questions