Efficient way to use pandas group by on a million records

Question

I have a dataframe which can be generated using the code below

df2 = pd.DataFrame({'subject_ID':[1,1,1,1,1,1,2,2,2,2],'colum' : ['L1CreaDate','L1Crea','L2CreaDate','L2Crea','L3CreaDate','L3Crea','L1CreaDate','L1Crea','L2CreaDate','L2Crea'], 
                'dates':['2016-10-30 00:00:00',2.3,'2016-10-30 00:00:00',2.5,np.nan,np.nan,'2016-10-30 00:00:00',12.3,'2016-10-30 00:00:00',12.3]})

I am trying to do the below operations on the above dataframe. Though the code works absolutely fine , the issue is when I use the group by statement. It's quick in sample dataframe but in real data with over 1 million records, it takes a while and just running for a long time

    df2['col2'] = df2['colum'].str.split("Date").str[0]
    df2['col3'] = df2['col2'].str.extract('(\d+)', expand=True).astype(int)
    df2 = df2.sort_values(by=['subject_ID','col3'])
    df2['count'] = df2.groupby(['subject_ID','col2'])['dates'].transform(pd.Series.count)

I do groupby to get the below output count column so that I can reject records with count as 0. There is a logic involved behind dropping NA's. It's not just about dropping all NA's. If you would like to know about that then refer this post retain few NA's and drop rest of the NA's logic

In real data one person might have more than 10000 rows. So a single dataframe has more than 1 million rows.

Is there any other better and efficient way to do a groupby or get the count column?

jezrael · Accepted Answer

Idea is use list comprehension with split for improve performance, then not assign output to new column count but filtering and last sorting with extracted integers:

df2['col2'] = [x.split("Date")[0] for x in df2['colum']]
df2 = df2[df2.groupby(['subject_ID','col2'])['dates'].transform('count').ne(0)].copy()

df2['col3'] = df2['col2'].str.extract('(\d+)', expand=True).astype(int)
df2 = df2.sort_values(by=['subject_ID','col3'])
print (df2)
   subject_ID       colum                dates    col2  col3
0           1  L1CreaDate  2016-10-30 00:00:00  L1Crea     1
1           1      L1Crea                  2.3  L1Crea     1
2           1  L2CreaDate  2016-10-30 00:00:00  L2Crea     2
3           1      L2Crea                  2.5  L2Crea     2
6           2  L1CreaDate  2016-10-30 00:00:00  L1Crea     1
7           2      L1Crea                 12.3  L1Crea     1
8           2  L2CreaDate  2016-10-30 00:00:00  L2Crea     2
9           2      L2Crea                 12.3  L2Crea     2

If get error:

AttributeError: 'float' object has no attribute 'split'

it means there are possible missing values, so list comprehension should be changed:

df2['col2'] = [x.split("Date")[0] if x == x else np.nan for x in df2['colum']]

For check performance:

def new(df2):
    df2['col2'] = [x.split("Date")[0] for x in df2['colum']]
    df2 = df2[df2.groupby(['subject_ID','col2'])['dates'].transform('count').ne(0)].copy()
    df2['col3'] = df2['col2'].str.extract('(\d+)', expand=True).astype(int)
    return df2.sort_values(by=['subject_ID','col3'])


def orig(df2):
    df2['col2'] = df2['colum'].str.split("Date").str[0]
    df2['col3'] = df2['col2'].str.extract('(\d+)', expand=True).astype(int)
    df2 = df2.sort_values(by=['subject_ID','col3'])
    df2['count'] = df2.groupby(['subject_ID','col2'])['dates'].transform(pd.Series.count)
    return df2[df2['count'].ne(0)]

In [195]: %timeit (orig(df2))
10.8 ms ± 728 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [196]: %timeit (new(df2))
6.11 ms ± 144 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Efficient way to use pandas group by on a million records

Answers (1)

Related Questions