Removing a large number of IDs from a large dataframe takes a long time

Question

I have two dataframes df1 and df2

print(df1.shape)
(1042009, 40)

print(df1.columns)
Index(['date_acte', 'transaction_id', 'amount', ...],
      dtype='object')

print(df2.shape)
(734738, 37)

print(df2.columns)
Index(['date', 'transaction_id', 'amount', ...],
      dtype='object')

I would like to remove the unique transaction_id in df2 from df1 and keep the rest.

I did the following:

Filtre = list(df2.transaction_id.unique())
print(len(Filtre))
733465

noMatched = df1.loc[
    (~df1['transaction_id'].str.contains('|'.join(Filtre), case=False, na=False))]

My problem is that the output noMatched takes almost 5 hours to get ready. I wonder if there is a more efficient way to write this piece of code. Can the output be generated in less than 5 hours?

Serge de Gosson de Varennes · Accepted Answer

You could do it this way:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({
    'transaction_id': np.random.randint(1000000, 2000000, size=1042009),
    'amount': np.random.rand(1042009),
    'date_acte': pd.date_range('2020-01-01', periods=1042009, freq='T')
})

df2 = pd.DataFrame({
    'transaction_id': np.random.randint(1500000, 2500000, size=734738),
    'amount': np.random.rand(734738),
    'date': pd.date_range('2020-01-01', periods=734738, freq='T')
})

start_time = time.time()

filtre_set = set(df2['transaction_id'].unique())

noMatched = df1[~df1['transaction_id'].isin(filtre_set)]

end_time = time.time()

print(f"Filtered DataFrame shape: {noMatched}")
print(f"Execution time: {end_time - start_time:.2f} seconds")

which returns

Filtered DataFrame shape:          transaction_id    amount           date_acte
1               1231651  0.849124 2020-01-01 00:01:00
2               1443550  0.031414 2020-01-01 00:02:00
3               1164444  0.973699 2020-01-01 00:03:00
4               1371353  0.554666 2020-01-01 00:04:00
7               1072327  0.867207 2020-01-01 00:07:00
...                 ...       ...                 ...
1042004         1499512  0.114861 2021-12-24 14:44:00
1042005         1255963  0.756608 2021-12-24 14:45:00
1042006         1203341  0.091380 2021-12-24 14:46:00
1042007         1016687  0.153179 2021-12-24 14:47:00
1042008         1036581  0.382781 2021-12-24 14:48:00

[770625 rows x 3 columns]
Execution time: 0.52 seconds

Removing a large number of IDs from a large dataframe takes a long time

Answers (2)

Related Questions