How to make code more efficient in terms of speed

Question

Suppose we have a pickle file called pickle_list.pkl which contains 23 pandas data frames. Also df_combined is a concatenation of all those data frames. Suppose that the shape of df_combined is (1000000, 5000). Is there a more efficient way of running the following block of code? Maybe some type of parallelization could work?

Right now it is on row 69000 and it has been running for a day.

import pickle
import pandas as pd
df_list = pd.read_pickle(r'pickle_list.pkl')
df_combined = pd.concat(df_list, ignore_index=True)

for row in df_combined.itertuples():
    print(row.Index)
    id = row.id
    df_test= df_combined[df_combined['id']==str(id)]

J&#233;r&#244;me Richard · Accepted Answer

You can use groupby to create a dictionary efficiently then used to fetch the required identifiers quickly. Here is an untested example to show the idea:

import pickle
import pandas as pd
df_list = pd.read_pickle(r'pickle_list.pkl')
df_combined = pd.concat(df_list, ignore_index=True)

all_groups = {ident:df for ident,df in df_combined.groupby('id')}

for row in df_combined.itertuples():
    id = row.id
    # You may need to add a condition before if the searched ID does not exist
    df_test = all_groups[str(id)]

How to make code more efficient in terms of speed

Answers (1)

Related Questions