Reputation: 19
Good afternoon,
I'm iterating through a huge Dataframe (104062 x 20) with the following code:
import pandas as pd
df_tot = pd.read_csv("C:\\Users\\XXXXX\\Desktop\\XXXXXXX\\LOGS\\DF_TOT.txt", header=None)
df_tot = df_tot.replace("\[", "", regex=True)
df_tot = df_tot.replace("\]", "", regex=True)
df_tot = df_tot.replace("\'", "", regex=True)
i = 0
while i < len(df_tot):
to_compare = df_tot.iloc[i].tolist()
for j in range(len(df_tot)):
if to_compare == df_tot.iloc[j].tolist():
if i == j:
print('Matched itself.')
else:
print('MATCH FOUND - row: {} --- match row: {}'.format(i,j))
i += 1
I am looking to optimize time spent for each iteration as much as possible, since this code iterates 104062(^2) times. (More or less ten billions iterations).
With my computing power, time spent comparing to_compare
in the whole DF is around 26 seconds.
I want to clarify that in case it would be needed, the whole code could be changed with faster constructs.
As usual, Thanks in advance.
Upvotes: 0
Views: 58
Reputation: 136
as far as i understand, You just want to find duplicated rows.
Sample data(2 last rows are duplicated):
In [1]: df = pd.DataFrame([[1,2], [3,4], [5,6], [7,8], [1,2], [5,6]], columns=['a', 'b'])
df
Out[1]:
a b
0 1 2
1 3 4
2 5 6
3 7 8
4 1 2
5 5 6
This will return all duplicated rows:
In [2]: df[df.duplicated(keep=False)]
Out[2]:
a b
0 1 2
2 5 6
4 1 2
5 5 6
And indexes, grouped by duplicated row:
In [3]: df[df.duplicated(keep=False)].reset_index().groupby(list(df.columns), as_index=False)['index'].apply(list)
Out[3]: a b
1 2 [0, 4]
5 6 [2, 5]
You can also just remove duplicates from dataframe:
In [4]: df.drop_duplicates()
Out[4]:
a b
0 1 2
1 3 4
2 5 6
3 7 8
Upvotes: 1