Reputation: 8669
I have a dataframe
import pandas as pd
df=pd.DataFrame({'Players': [ 'Sam', 'Greg', 'Steve', 'Sam',
'Greg', 'Steve', 'Greg', 'Steve', 'Greg', 'Steve'],
'Wins': [10,5,5,20,30,20,6,9,3,10],
'Losses': [5,5,5,2,3,2,16,20,3,12],
'Type': ['A','B','B','B','A','B','B','A','A','B'],
})
If I wanted to summarize the I can manually make another dataframe:
p=df.groupby('Players')
summary = pd.DataFrame({'Total Games': p.Players.count(),
'Average Wins':p.Wins.mean(),
'Greatest Wins':p.Wins.max(),
'Unique games':p.Type.nunique()})
Lets say I want to automate this summarizing process to create a dataframe if column X is present perform summarization Y what would the best way of doing this? I have attempted using a dictionary but I think I did something wrong
p=df.groupby('Players')
sumdict = {'Total Games': ['Players', p.Players.count()],
'Average Wins':['Wins', p.Wins.mean()],
'Greatest Wins':['Wins', p.Wins.max()],
'Unique games':['Type', p.Type.nunique()],
'Max Score':['Score', p.Score.max()]}
summary=pd.DataFrame()
for key, value in sumdict.items():
if value[0] in df.columns:
data = pd.DataFrame({key: value[1],})
summary=summary.append(data)
else:
continue
Upvotes: 1
Views: 125
Reputation: 10318
Pandas DataFrame
s support most of the dict
methods, include get
(which allows you to substitute a value for an empty key). So you can do the statistics you want on all columns, then get the values for the column you want, substituting an empty Series
for missing columns, then drop NaN
columns (I use Bad Value
to demonstrate what happens to missing columns):
eser = pd.Series()
count = p.count().max(axis=1)
all_max = p.max()
score_max = all_max.get('Score', eser)
wins_max = all_max.get('Wins', eser)
wins_mean = p.mean().get('Wins', eser)
type_nunique = p.agg(lambda x: x.nunique()).get('Type', eser)
summary = pd.DataFrame({'Total Games': count,
'Average Wins': wins_mean,
'Greatest Wins': wins_max,
'Unique games': type_nunique,
'Max Score': score_max})
summary.dropna(axis=1, how='all', inplace=True)
Or a one-liner (which involves calculating the max
on all columns twice, which shouldn't be a problem for a small number of values):
summary = pd.DataFrame({'Total Games': p.count().max(axis=1),
'Average Wins': p.mean().get('Wins', pd.Series()),
'Greatest Wins': p.max().get('Wins', pd.Series()),
'Unique games': p.agg(lambda x: x.nunique()).get('Type', pd.Series()),
'Max Score': p.max().get('Score', pd.Series())}).dropna(axis=1, how='all')
The result of either approach:
Average Wins Greatest Wins Total Games Unique games
Greg 11 30 4 2
Sam 15 20 2 2
Steve 11 20 4 2
Without the dropna
:
Average Wins Greatest Wins Max Score Total Games Unique games
Greg 11 30 NaN 4 2
Sam 15 20 NaN 2 2
Steve 11 20 NaN 4 2
If performance is an issue, the above will be slow because they require computing multiple statistics on all columns, which means statistics are being computed and then thrown away. A faster, but uglier, approach is similar to your approach of using a loop over a dict
.
The problem with your implementation is that the dict
items are not lazily evaluated, they are evaluated when the dict
is created, which means it still tries to access the non-existent columns.
The approach below gets both the item and applies the function only when the column is found (with special handling for the count
case, since any existing column will work):
sumdict = {'Total Games': (None, 'count'),
'Average Wins': ('Wins', 'mean'),
'Greatest Wins': ('Wins', 'max'),
'Unique games': ('Type', 'nunique'),
'Max Score': ('Score', 'max')}
summary = []
for key, (column, op) in sumdict.items():
if column is None:
res = p.agg(op).max(axis=1)
elif column not in df:
continue
else:
res = p[column].agg(lambda x: getattr(x, op)())
summary.append(pd.DataFrame({key: res}))
summary = pd.concat(summary, axis=1)
It gives the same result as my above approaches, although with a different column order.
Upvotes: 1