Reputation: 399
I have two dataframes df1
and df2
print(df1.shape)
(1042009, 40)
print(df1.columns)
Index(['date_acte', 'transaction_id', 'amount', ...],
dtype='object')
print(df2.shape)
(734738, 37)
print(df2.columns)
Index(['date', 'transaction_id', 'amount', ...],
dtype='object')
I would like to remove the unique transaction_id
in df2
from df1
and keep the rest.
I did the following:
Filtre = list(df2.transaction_id.unique())
print(len(Filtre))
733465
noMatched = df1.loc[
(~df1['transaction_id'].str.contains('|'.join(Filtre), case=False, na=False))]
My problem is that the output noMatched
takes almost 5 hours to get ready. I wonder if there is a more efficient way to write this piece of code. Can the output be generated in less than 5 hours?
Upvotes: 0
Views: 93
Reputation: 1976
I have tried using other fast packages like polars and dask beside pandas and reported their average time execution over the same task for 10 runs. The result shows polars as the fastest.
import pandas as pd
import numpy as np
import time
import polars as pl
import dask.dataframe as dd
import matplotlib.pyplot as plt
# Generate sample data as per the description (outside the loop for efficiency)
df1 = pd.DataFrame({
'transaction_id': np.random.randint(1000000, 2000000, size=1042009),
'amount': np.random.rand(1042009),
'date_acte': pd.date_range('2020-01-01', periods=1042009, freq='T')
})
df2 = pd.DataFrame({
'transaction_id': np.random.randint(1500000, 2500000, size=734738),
'amount': np.random.rand(734738),
'date': pd.date_range('2020-01-01', periods=734738, freq='T')
})
df1_pl = pl.DataFrame(df1)
df2_pl = pl.DataFrame(df2)
df1_dask = dd.from_pandas(df1, npartitions=10)
df2_dask = dd.from_pandas(df2, npartitions=10)
pandas_times = []
polars_times = []
dask_times = []
for _ in range(10):
# Measure execution time for Pandas
start_time = time.time()
filtre_set_pd = set(df2['transaction_id'])
noMatched_pd = df1[~df1['transaction_id'].isin(filtre_set_pd)]
end_time = time.time()
pandas_times.append(end_time - start_time)
# Measure execution time for Polars
start_time = time.time()
filtre_set_pl = df2_pl['transaction_id'].unique()
noMatched_pl = df1_pl.filter(~df1_pl['transaction_id'].is_in(filtre_set_pl))
end_time = time.time()
polars_times.append(end_time - start_time)
# Measure execution time for Dask
start_time = time.time()
filtre_set_dask = df2_dask['transaction_id'].unique().compute()
noMatched_dask = df1_dask[~df1_dask['transaction_id'].isin(filtre_set_dask)].compute()
end_time = time.time()
dask_times.append(end_time - start_time)
print(f"Average Pandas Execution time: {np.mean(pandas_times):.2f} seconds")
print(f"Average Polars Execution time: {np.mean(polars_times):.2f} seconds")
print(f"Average Dask Execution time: {np.mean(dask_times):.2f} seconds")
# Plot execution times
plt.figure(figsize=(8, 3))
bars = plt.bar(['Pandas', 'Polars', 'Dask'], [np.mean(pandas_times), np.mean(polars_times), np.mean(dask_times)],
color=['lightcoral', 'skyblue', 'lightgreen'], width=0.3)
plt.ylabel("Average Execution Time (seconds)")
plt.title("Average Execution Time Comparison: Pandas vs Polars vs Dask (10 runs)")
for bar in bars:
yval = bar.get_height()
plt.text(bar.get_x() + bar.get_width()/2, yval + 0.01, round(yval, 2), ha='center', va='bottom', fontsize=10)
plt.show()
output:
Another inexpensive solution to speed up pandas is: using fireduck package. Ref.
Run the following in separate cell:
#!pip install fireducks
# import pandas as pd
import fireducks.pandas as pd
we re-run the same code now:
import pandas as pd # Import the actual pandas library
import numpy as np
import time
import polars as pl
import dask.dataframe as dd
import matplotlib.pyplot as plt
# ... (rest of code) ...
output:
With this simple trick, you can speed-up pandas by 20x\30x faster!
I hope this helps you pick a better solution. If you have better computing infrastructures for better parallelization, you can also use PySpark
to make it faster.
Upvotes: 2
Reputation: 11522
You could do it this way:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'transaction_id': np.random.randint(1000000, 2000000, size=1042009),
'amount': np.random.rand(1042009),
'date_acte': pd.date_range('2020-01-01', periods=1042009, freq='T')
})
df2 = pd.DataFrame({
'transaction_id': np.random.randint(1500000, 2500000, size=734738),
'amount': np.random.rand(734738),
'date': pd.date_range('2020-01-01', periods=734738, freq='T')
})
start_time = time.time()
filtre_set = set(df2['transaction_id'].unique())
noMatched = df1[~df1['transaction_id'].isin(filtre_set)]
end_time = time.time()
print(f"Filtered DataFrame shape: {noMatched}")
print(f"Execution time: {end_time - start_time:.2f} seconds")
which returns
Filtered DataFrame shape: transaction_id amount date_acte
1 1231651 0.849124 2020-01-01 00:01:00
2 1443550 0.031414 2020-01-01 00:02:00
3 1164444 0.973699 2020-01-01 00:03:00
4 1371353 0.554666 2020-01-01 00:04:00
7 1072327 0.867207 2020-01-01 00:07:00
... ... ... ...
1042004 1499512 0.114861 2021-12-24 14:44:00
1042005 1255963 0.756608 2021-12-24 14:45:00
1042006 1203341 0.091380 2021-12-24 14:46:00
1042007 1016687 0.153179 2021-12-24 14:47:00
1042008 1036581 0.382781 2021-12-24 14:48:00
[770625 rows x 3 columns]
Execution time: 0.52 seconds
Upvotes: 2