bravopapa
bravopapa

Reputation: 399

Removing a large number of IDs from a large dataframe takes a long time

I have two dataframes df1 and df2

print(df1.shape)
(1042009, 40)

print(df1.columns)
Index(['date_acte', 'transaction_id', 'amount', ...],
      dtype='object')

print(df2.shape)
(734738, 37)

print(df2.columns)
Index(['date', 'transaction_id', 'amount', ...],
      dtype='object')

I would like to remove the unique transaction_id in df2 from df1 and keep the rest.

I did the following:

Filtre = list(df2.transaction_id.unique())
print(len(Filtre))
733465

noMatched = df1.loc[
    (~df1['transaction_id'].str.contains('|'.join(Filtre), case=False, na=False))]

My problem is that the output noMatched takes almost 5 hours to get ready. I wonder if there is a more efficient way to write this piece of code. Can the output be generated in less than 5 hours?

Upvotes: 0

Views: 93

Answers (2)

Mario
Mario

Reputation: 1976

I have tried using other fast packages like and beside and reported their average time execution over the same task for 10 runs. The result shows as the fastest.

import pandas as pd
import numpy as np
import time
import polars as pl
import dask.dataframe as dd
import matplotlib.pyplot as plt

# Generate sample data as per the description (outside the loop for efficiency)
df1 = pd.DataFrame({
    'transaction_id': np.random.randint(1000000, 2000000, size=1042009),
    'amount': np.random.rand(1042009),
    'date_acte': pd.date_range('2020-01-01', periods=1042009, freq='T')
})

df2 = pd.DataFrame({
    'transaction_id': np.random.randint(1500000, 2500000, size=734738),
    'amount': np.random.rand(734738),
    'date': pd.date_range('2020-01-01', periods=734738, freq='T')
})

df1_pl = pl.DataFrame(df1)
df2_pl = pl.DataFrame(df2)

df1_dask = dd.from_pandas(df1, npartitions=10)
df2_dask = dd.from_pandas(df2, npartitions=10)


pandas_times = []
polars_times = []
dask_times = []

for _ in range(10):
    # Measure execution time for Pandas
    start_time = time.time()
    filtre_set_pd = set(df2['transaction_id'])
    noMatched_pd = df1[~df1['transaction_id'].isin(filtre_set_pd)]
    end_time = time.time()
    pandas_times.append(end_time - start_time)

    # Measure execution time for Polars
    start_time = time.time()
    filtre_set_pl = df2_pl['transaction_id'].unique()
    noMatched_pl = df1_pl.filter(~df1_pl['transaction_id'].is_in(filtre_set_pl))
    end_time = time.time()
    polars_times.append(end_time - start_time)
    
    # Measure execution time for Dask
    start_time = time.time()
    filtre_set_dask = df2_dask['transaction_id'].unique().compute()
    noMatched_dask = df1_dask[~df1_dask['transaction_id'].isin(filtre_set_dask)].compute()
    end_time = time.time()
    dask_times.append(end_time - start_time)

print(f"Average Pandas Execution time: {np.mean(pandas_times):.2f} seconds")
print(f"Average Polars Execution time: {np.mean(polars_times):.2f} seconds")
print(f"Average Dask Execution time:   {np.mean(dask_times):.2f} seconds")

# Plot execution times
plt.figure(figsize=(8, 3))
bars = plt.bar(['Pandas', 'Polars', 'Dask'], [np.mean(pandas_times), np.mean(polars_times), np.mean(dask_times)],
                color=['lightcoral', 'skyblue', 'lightgreen'], width=0.3)
plt.ylabel("Average Execution Time (seconds)")
plt.title("Average Execution Time Comparison: Pandas vs Polars vs Dask (10 runs)")

for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2, yval + 0.01, round(yval, 2), ha='center', va='bottom', fontsize=10)

plt.show()

output:

img


Another inexpensive solution to speed up is: using package. Ref.

Run the following in separate cell:

#!pip install fireducks
# import pandas as pd
import fireducks.pandas as pd

we re-run the same code now:

import pandas as pd # Import the actual pandas library
import numpy as np
import time
import polars as pl
import dask.dataframe as dd
import matplotlib.pyplot as plt

# ... (rest of code) ...

output:

img

With this simple trick, you can speed-up by 20x\30x faster!

I hope this helps you pick a better solution. If you have better computing infrastructures for better parallelization, you can also use PySpark to make it faster.

Upvotes: 2

You could do it this way:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({
    'transaction_id': np.random.randint(1000000, 2000000, size=1042009),
    'amount': np.random.rand(1042009),
    'date_acte': pd.date_range('2020-01-01', periods=1042009, freq='T')
})

df2 = pd.DataFrame({
    'transaction_id': np.random.randint(1500000, 2500000, size=734738),
    'amount': np.random.rand(734738),
    'date': pd.date_range('2020-01-01', periods=734738, freq='T')
})

start_time = time.time()

filtre_set = set(df2['transaction_id'].unique())

noMatched = df1[~df1['transaction_id'].isin(filtre_set)]

end_time = time.time()

print(f"Filtered DataFrame shape: {noMatched}")
print(f"Execution time: {end_time - start_time:.2f} seconds")

which returns

Filtered DataFrame shape:          transaction_id    amount           date_acte
1               1231651  0.849124 2020-01-01 00:01:00
2               1443550  0.031414 2020-01-01 00:02:00
3               1164444  0.973699 2020-01-01 00:03:00
4               1371353  0.554666 2020-01-01 00:04:00
7               1072327  0.867207 2020-01-01 00:07:00
...                 ...       ...                 ...
1042004         1499512  0.114861 2021-12-24 14:44:00
1042005         1255963  0.756608 2021-12-24 14:45:00
1042006         1203341  0.091380 2021-12-24 14:46:00
1042007         1016687  0.153179 2021-12-24 14:47:00
1042008         1036581  0.382781 2021-12-24 14:48:00

[770625 rows x 3 columns]
Execution time: 0.52 seconds

Upvotes: 2

Related Questions