Reputation: 49
Suppose we have a pickle file called pickle_list.pkl
which contains 23 pandas data frames. Also df_combined
is a concatenation of all those data frames. Suppose that the shape of df_combined
is (1000000, 5000)
. Is there a more efficient way of running the following block of code? Maybe some type of parallelization could work?
Right now it is on row 69000 and it has been running for a day.
import pickle
import pandas as pd
df_list = pd.read_pickle(r'pickle_list.pkl')
df_combined = pd.concat(df_list, ignore_index=True)
for row in df_combined.itertuples():
print(row.Index)
id = row.id
df_test= df_combined[df_combined['id']==str(id)]
Upvotes: 0
Views: 64
Reputation: 50358
You can use groupby
to create a dictionary efficiently then used to fetch the required identifiers quickly. Here is an untested example to show the idea:
import pickle
import pandas as pd
df_list = pd.read_pickle(r'pickle_list.pkl')
df_combined = pd.concat(df_list, ignore_index=True)
all_groups = {ident:df for ident,df in df_combined.groupby('id')}
for row in df_combined.itertuples():
id = row.id
# You may need to add a condition before if the searched ID does not exist
df_test = all_groups[str(id)]
Upvotes: 1