Reputation: 199
I already asked a variation of this question, but I still have a problem regarding the runtime of my code.
Given a numpy array consisting of 15000 rows and 44 columns. My goal is to find out which rows are equal and add them to a list, like this:
1 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
1 0 0 0 0
1 2 3 4 5
Result:
equal_rows1 = [1,2,3]
equal_rows2 = [0,4]
What I did up till now is using the following code:
import numpy as np
input_data = np.load('IN.npy')
equal_inputs1 = []
equal_inputs2 = []
for i in range(len(input_data)):
for j in range(i+1,len(input_data)):
if np.array_equal(input_data[i],input_data[j]):
equal_inputs1.append(i)
equal_inputs2.append(j)
The problem is that it takes a lot of time to return the desired arrays and that this allows only 2 different "similar row lists" although there can be more. Is there any better solution for this, especially regarding the runtime?
Upvotes: 2
Views: 252
Reputation: 164843
You can use collections.defaultdict
, which retains the row values as keys:
from collections import defaultdict
dd = defaultdict(list)
for idx, row in enumerate(df.values):
dd[tuple(row)].append(idx)
print(list(dd.values()))
# [[0, 4], [1, 2, 3], [5]]
print(dd)
# defaultdict(<class 'list'>, {(1, 0, 0, 0, 0): [0, 4],
# (0, 0, 0, 0, 0): [1, 2, 3],
# (1, 2, 3, 4, 5): [5]})
You can, if you wish, filter out unique rows via a dictionary comprehension.
Upvotes: 1
Reputation: 403218
This is pretty simple with pandas groupby
:
df
A B C D E
0 1 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 1 0 0 0 0
5 1 2 3 4 5
[g.index.tolist() for _, g in df.groupby(df.columns.tolist()) if len(g.index) > 1]
# [[1, 2, 3], [0, 4]]
If you are dealing with many rows and many unique groups, this might get a bit slow. The performance depends on your data. Perhaps there is a faster NumPy alternative, but this is certainly the easiest to understand.
Upvotes: 1