Reputation: 7703
I have a dataframe which can be generated using the code below
df2 = pd.DataFrame({'subject_ID':[1,1,1,1,1,1,2,2,2,2],'colum' : ['L1CreaDate','L1Crea','L2CreaDate','L2Crea','L3CreaDate','L3Crea','L1CreaDate','L1Crea','L2CreaDate','L2Crea'],
'dates':['2016-10-30 00:00:00',2.3,'2016-10-30 00:00:00',2.5,np.nan,np.nan,'2016-10-30 00:00:00',12.3,'2016-10-30 00:00:00',12.3]})
I am trying to do the below operations on the above dataframe. Though the code works absolutely fine , the issue is when I use the group by statement
. It's quick in sample dataframe but in real data with over 1 million records, it takes a while and just running for a long time
df2['col2'] = df2['colum'].str.split("Date").str[0]
df2['col3'] = df2['col2'].str.extract('(\d+)', expand=True).astype(int)
df2 = df2.sort_values(by=['subject_ID','col3'])
df2['count'] = df2.groupby(['subject_ID','col2'])['dates'].transform(pd.Series.count)
I do groupby
to get the below output count
column so that I can reject records with count as 0
. There is a logic involved behind dropping NA's. It's not just about dropping all NA's. If you would like to know about that then refer this post retain few NA's and drop rest of the NA's logic
In real data one person might have more than 10000 rows. So a single dataframe has more than 1 million rows.
Is there any other better and efficient way to do a groupby
or get the count
column?
Upvotes: 1
Views: 1674
Reputation: 862611
Idea is use list comprehension with split
for improve performance, then not assign output to new column count
but filtering and last sorting with extracted integers:
df2['col2'] = [x.split("Date")[0] for x in df2['colum']]
df2 = df2[df2.groupby(['subject_ID','col2'])['dates'].transform('count').ne(0)].copy()
df2['col3'] = df2['col2'].str.extract('(\d+)', expand=True).astype(int)
df2 = df2.sort_values(by=['subject_ID','col3'])
print (df2)
subject_ID colum dates col2 col3
0 1 L1CreaDate 2016-10-30 00:00:00 L1Crea 1
1 1 L1Crea 2.3 L1Crea 1
2 1 L2CreaDate 2016-10-30 00:00:00 L2Crea 2
3 1 L2Crea 2.5 L2Crea 2
6 2 L1CreaDate 2016-10-30 00:00:00 L1Crea 1
7 2 L1Crea 12.3 L1Crea 1
8 2 L2CreaDate 2016-10-30 00:00:00 L2Crea 2
9 2 L2Crea 12.3 L2Crea 2
If get error:
AttributeError: 'float' object has no attribute 'split'
it means there are possible missing values, so list comprehension should be changed:
df2['col2'] = [x.split("Date")[0] if x == x else np.nan for x in df2['colum']]
For check performance:
def new(df2):
df2['col2'] = [x.split("Date")[0] for x in df2['colum']]
df2 = df2[df2.groupby(['subject_ID','col2'])['dates'].transform('count').ne(0)].copy()
df2['col3'] = df2['col2'].str.extract('(\d+)', expand=True).astype(int)
return df2.sort_values(by=['subject_ID','col3'])
def orig(df2):
df2['col2'] = df2['colum'].str.split("Date").str[0]
df2['col3'] = df2['col2'].str.extract('(\d+)', expand=True).astype(int)
df2 = df2.sort_values(by=['subject_ID','col3'])
df2['count'] = df2.groupby(['subject_ID','col2'])['dates'].transform(pd.Series.count)
return df2[df2['count'].ne(0)]
In [195]: %timeit (orig(df2))
10.8 ms ± 728 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [196]: %timeit (new(df2))
6.11 ms ± 144 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 1