Marc Massar
Marc Massar

Reputation: 57

Optimizing pandas groupby on many small groups

I have a pandas DataFrame with many small groups:

In [84]: n=10000

In [85]: df=pd.DataFrame({'group':sorted(range(n)*4),'val':np.random.randint(6,size=4*n)}).sort(['group','val']).reset_index(drop=True)

In [86]: df.head(9)
Out[86]: 
   group  val
0      0    0
1      0    0
2      0    1
3      0    2
4      1    1
5      1    2
6      1    2
7      1    4
8      2    0

I want to do something special for groups where val==1 appears but not val==0. E.g. replace the 1 in the group by 99 only if the val==0 is in that group.

But for DataFrames of this size it is quite slow:

In [87]: def f(s):
   ....: if (0 not in s) and (1 in s): s[s==1]=99
   ....: return s
   ....: 

In [88]: %timeit df.groupby('group')['val'].transform(f)
1 loops, best of 3: 11.2 s per loop

Looping through the data frame is much uglier but much faster:

In [89]: %paste

def g(df):
    df.sort(['group','val'],inplace=True)
    last_g=-1
    for i in xrange(len(df)):
        if df.group.iloc[i]!=last_g:
            has_zero=False
        if df.val.iloc[i]==0:
            has_zero=True
        elif has_zero and df.val.iloc[i]==1:
            df.val.iloc[i]=99
    return df
## -- End pasted text --

In [90]: %timeit g(df)
1 loops, best of 3: 2.53 s per loop

But I would like to optimizing it further if possible.

Any idea of how to do so?

Thanks


Based on Jeff's answer, I got a solution that is very fast. I'm putting it here if others find it useful:

In [122]: def do_fast(df):
   .....: has_zero_mask=df.group.isin(df[df.val==0].group.unique())
   .....: df.val[(df.val==1) & has_zero_mask]=99
   .....: return df
   .....: 

In [123]: %timeit do_fast(df)
100 loops, best of 3: 11.2 ms per loop

Upvotes: 0

Views: 857

Answers (1)

Jeff
Jeff

Reputation: 128948

Not 100% sure this is what you are going for, but should be simple to have a different filtering/setting criteria

In [37]: pd.set_option('max_rows',10)

In [38]: np.random.seed(1234)

In [39]: def f():

           # create the frame
           df=pd.DataFrame({'group':sorted(range(n)*4),
                                 'val':np.random.randint(6,size=4*n)}).sort(['group','val']).reset_index(drop=True)


           df['result'] = np.nan

           # Create a count per group
           df['counter'] = df.groupby('group').cumcount()

           # select which values you want, returning the indexes of those
           mask = df[df.val==1].groupby('group').grouper.group_info[0]

           # set em
           df.loc[df.index.isin(mask) & df['counter'] == 1,'result'] = 99


In [40]: %timeit f()
10 loops, best of 3: 95 ms per loop

In [41]: df
Out[41]: 
       group  val  result  counter
0          0    3     NaN        0
1          0    4      99        1
2          0    4     NaN        2
3          0    5      99        3
4          1    0     NaN        0
...      ...  ...     ...      ...
39995   9998    4     NaN        3
39996   9999    0     NaN        0
39997   9999    0     NaN        1
39998   9999    2     NaN        2
39999   9999    3     NaN        3

[40000 rows x 4 columns]

Upvotes: 1

Related Questions