Reputation:
I have a DataFrame with MultiIndex which is basically a binary matrix:
day day01 day02
session session1 session2 session3 session1 session2 session3
0 1 0 0 0 0 0
1 0 0 1 1 1 0
2 1 1 1 0 0 1
3 1 0 0 1 0 0
4 1 0 1 0 0 0
From this DataFrame, I need to calculate daily sums for each row:
day01 day02
0 1 0
1 1 2
2 3 1
3 1 1
4 2 0
And get the number of 0s, 1s... (value counts) in this sum:
0 2
1 5
2 2
3 1
I need to do this for sessions, too. Session sums for each row:
session1 session2 session3
0 1 0 0
1 1 1 1
2 1 1 2
3 2 0 0
4 1 0 1
And get the value counts:
0 5
1 8
2 2
As a baseline, this is the result of df.groupby(level='day', axis=1).sum().stack().value_counts()
(and df.groupby(level='session', axis=1).sum().stack().value_counts()
). The DataFrame changes in each iteration of a simulated annealing algorithm and these counts are recalculated. When I profiled the code I saw that a significant amount of time spent on groupby operations.
I tried saving groupby objects and taking sums on those objects in each iteration but the improvement was about 10%. Here's the code to create a larger DataFrame (similar to the one I have):
import numpy as np
import pandas as pd
prng = np.random.RandomState(0)
days = ['day{0:02d}'.format(i) for i in range(1, 11)]
sessions = ['session{}'.format(i) for i in range(1, 5)]
idx = pd.MultiIndex.from_product((days, sessions), names=['day', 'session'])
df = pd.DataFrame(prng.binomial(1, 0.25, (1250, 40)), columns=idx)
In my computer, the following two methods take 3.8s and 3.38s respectively.
def try1(df, num_repeats=1000):
for i in range(num_repeats):
session_counts = (df.groupby(level='session', axis=1, sort=False)
.sum()
.stack()
.value_counts(sort=False))
daily_counts = (df.groupby(level='day', axis=1, sort=False)
.sum()
.stack()
.value_counts(sort=False))
return session_counts, daily_counts
def try2(df, num_repeats=1000):
session_groups = df.groupby(level='session', axis=1, sort=False)
day_groups = df.groupby(level='day', axis=1, sort=False)
for i in range(num_repeats):
df.iat[0, 0] = (i + 1) % 2
session_counts = session_groups.sum().stack().value_counts(sort=False)
daily_counts = day_groups.sum().stack().value_counts(sort=False)
return session_counts, daily_counts
%time try1(df)
Wall time: 3.8 s
%time try2(df)
Wall time: 3.38 s
Note: The loops in the functions are for timing only. For the second function in order to get correct timings I needed to modify the DataFrame.
I am currently working on another method to directly reflect the changes in the DataFrame to counts without recalculating the groups but I haven't succeeded yet. Tracking the affected rows and updating saved DataFrames turned out to be slower.
Is there a way to improve the performance of these groupby operations?
Upvotes: 4
Views: 375
Reputation: 221574
Assuming a regular data format (equal number of days and sessions across each row), here's a NumPy based approach using np.unique
with the output having their indexes in sorted order -
# Extract array
a,b = df.columns.levels
arr = df.values.reshape(-1,len(a),len(b))
# Get session counts
session_sums = arr.sum(1)
unq,count = np.unique(session_sums,return_counts=True)
session_counts_out = pd.Series(count,index=unq)
# Get daily count
daily_sums = arr.sum(2)
unq,count = np.unique(daily_sums,return_counts=True)
daily_counts_out = pd.Series(count,index=unq)
If you are only interested in the values without the indexes, here's an alternative with np.bincount
that essentially just does the counting, as done by return_counts
part with np.unique
-
# Get session counts
session_sums = arr.sum(1)
count = np.bincount(session_sums.ravel())
session_counts_out = count[count>0]
# Get daily count
daily_sums = arr.sum(2)
count = np.bincount(daily_sums.ravel())
daily_counts_out = count[count>0]
Upvotes: 1