Reputation: 1212
I am looking for a way to rewrite a pandas groupby to improve the performance as the current version would last for ages on the dataset of interest:
def to_df_with_lists(group, gby):
ret_df = pd.DataFrame(columns=group.drop(gby, axis=1).columns, index=[0])
for col in group.drop(gby, axis=1).columns:
ret_df.loc[0, col] = list(group[col].values)
if len(ret_df.loc[0, col]) == 1:
ret_df.loc[0, col] = ret_df.loc[0, col][0]
return ret_df
Basically, for the given groupby, it saves the values to a list. I cannot use multiple rows as I merge it with other DataFrames in the similar format, and then the lengths of all the lists are different (later I convert it to another format).
From this:
I want to get this (note that if the list would have length 1 then groupby function returns a single value, not a list containing that value):
I know that this is not really best/common way to work with DataFrames, but I haven't found format that lets me do what I want.
Example DataFrame:
import pandas as pd
df_sub = pd.DataFrame({'director_id': [9970, 9970, 9970, 9970, 9970],
'genre': ['Adventure', 'Comedy', 'Crime', 'Drama', 'Romance'],
'prob': [0.041667, 0.083333, 0.166667, 0.833333, 0.083333]},
index=[17317, 17318, 17319, 17320, 17321])
group = df_sub.groupby('director_id').get_group(9970)
Upvotes: 1
Views: 296
Reputation: 4864
Except for making singleton lists atomic (which strikes me as a bad idea), the following works:
df_sub.groupby('director_id').agg(lambda x: list(x))
Whether it is much faster than your code, I cannot say (the example is too small).
Upvotes: 1