Valeria
Valeria

Reputation: 1212

Faster groupby in pandas: list of values

I am looking for a way to rewrite a pandas groupby to improve the performance as the current version would last for ages on the dataset of interest:

def to_df_with_lists(group, gby):
    ret_df = pd.DataFrame(columns=group.drop(gby, axis=1).columns, index=[0])
    for col in group.drop(gby, axis=1).columns:
        ret_df.loc[0, col] = list(group[col].values)
        if len(ret_df.loc[0, col]) == 1:
            ret_df.loc[0, col] = ret_df.loc[0, col][0]
    return ret_df

Basically, for the given groupby, it saves the values to a list. I cannot use multiple rows as I merge it with other DataFrames in the similar format, and then the lengths of all the lists are different (later I convert it to another format).

From this:

enter image description here

I want to get this (note that if the list would have length 1 then groupby function returns a single value, not a list containing that value):

enter image description here

I know that this is not really best/common way to work with DataFrames, but I haven't found format that lets me do what I want.

Example DataFrame:

import pandas as pd
df_sub = pd.DataFrame({'director_id': [9970, 9970, 9970, 9970, 9970], 
                       'genre': ['Adventure', 'Comedy', 'Crime', 'Drama', 'Romance'],
                       'prob': [0.041667, 0.083333, 0.166667, 0.833333, 0.083333]},
                      index=[17317, 17318, 17319, 17320, 17321])
group = df_sub.groupby('director_id').get_group(9970)

Upvotes: 1

Views: 296

Answers (1)

Igor Rivin
Igor Rivin

Reputation: 4864

Except for making singleton lists atomic (which strikes me as a bad idea), the following works:

df_sub.groupby('director_id').agg(lambda x: list(x))

Whether it is much faster than your code, I cannot say (the example is too small).

Upvotes: 1

Related Questions