Reputation: 95
I have the following dataframe:
df = pd.DataFrame({
'user': ['user122', 'user122', 'user124', 'user125', 'user125', 'user126', 'user126'],
'effortduration' : ['2 weeks', np.nan, '2 weeks', '3 weeks', np.nan, '2 weeks', '2 weeks'],
'callbacks' : [0, 0, 0, 0, 0, 1, 1],
'applications': [0, 0, 1, 0, 0, 1, 1]})
df
user effortduration callbacks applications
user122 2 weeks 0 0
user122 NaN 0 0
user124 2 weeks 0 1
user125 3 weeks 0 0
user125 NaN 0 0
user126 2 weeks 1 1
user126 2 weeks 1 1
I would like to groupby effortduration and get the count of each column based on the unique count of the user column. This is what I have tried so far:
function = {"user": pd.Series.nunique,
"callbacks": lambda x: x.nunique(),
"applications": lambda x: x.isin(['1']).nunique(),}
df.groupby('effortduration').agg(function)
user callbacks applications
effortduration
2 weeks 3 2 2
3 weeks 1 1 1
However, that is again not what I am looking for because the values of callbacks and applications are not based on the user column. My result should be something like this:
user callbacks applications
effortduration
2 weeks 3 1 2
3 weeks 1 0 0
Is there any way to do such a thing? If yes, is it also possible to generalize it because my original dataframe has many more columns and it would be painful to write all the functions by hand?
Upvotes: 1
Views: 84
Reputation: 62383
NaN
, and then drop NaN
if 'effortduration'
, 'callbacks'
, and 'applications'
are all NaN
.import pandas as pd
# sample data
df = pd.DataFrame({'user': ['user122', 'user122', 'user124', 'user125', 'user125', 'user126', 'user126'], 'effortduration' : ['2 weeks', np.nan, '2 weeks', '3 weeks', np.nan, '2 weeks', '2 weeks'], 'callbacks' : [0, 0, 0, 0, 0, 1, 1], 'applications': [0, 0, 1, 0, 0, 1, 1]})
# replace 0 and drop nan
df = df.replace(0, np.nan).dropna(how='all', subset=['effortduration', 'callbacks', 'applications'])
# drop duplicates
df = df.drop_duplicates()
# groupby and count
dfg = df.groupby(['effortduration']).count()
# dfg
user callbacks applications
effortduration
2 weeks 3 1 2
3 weeks 1 0 0
nunique
df = pd.DataFrame({'user': ['user122', 'user122', 'user124', 'user125', 'user125', 'user126', 'user126'], 'effortduration' : ['2 weeks', np.nan, '2 weeks', '3 weeks', np.nan, '2 weeks', '2 weeks'], 'callbacks' : [0, 0, 0, 0, 0, 1, 1], 'applications': [0, 0, 1, 0, 0, 1, 1]})
# using nunique
dfg = df.groupby('effortduration').nunique()
# dfg
user callbacks applications
effortduration
2 weeks 3 2 2
3 weeks 1 1 1
Upvotes: 1