How to get unique counts based on a different column, with pandas groupby

Question

I have the following dataframe:

df = pd.DataFrame({
    'user': ['user122', 'user122', 'user124', 'user125', 'user125', 'user126', 'user126'],
    'effortduration' : ['2 weeks', np.nan, '2 weeks', '3 weeks', np.nan, '2 weeks', '2 weeks'],
    'callbacks' : [0, 0, 0, 0, 0, 1, 1],
    'applications': [0, 0, 1, 0, 0, 1, 1]})
df

  user     effortduration  callbacks  applications
user122        2 weeks          0             0
user122            NaN          0             0
user124        2 weeks          0             1
user125        3 weeks          0             0
user125            NaN          0             0
user126        2 weeks          1             1
user126        2 weeks          1             1

I would like to groupby effortduration and get the count of each column based on the unique count of the user column. This is what I have tried so far:

function = {"user": pd.Series.nunique,
            "callbacks": lambda x: x.nunique(),
            "applications": lambda x: x.isin(['1']).nunique(),}

df.groupby('effortduration').agg(function)

                  user  callbacks  applications
effortduration                               
2 weeks            3          2             2
3 weeks            1          1             1

However, that is again not what I am looking for because the values of callbacks and applications are not based on the user column. My result should be something like this:

                   user  callbacks  applications
effortduration                               
2 weeks            3          1             2
3 weeks            1          0             0

Is there any way to do such a thing? If yes, is it also possible to generalize it because my original dataframe has many more columns and it would be painful to write all the functions by hand?

Trenton McKinney · Accepted Answer

This works with the sample data, I'm not sure with real data
Replace 0, with NaN, and then drop NaN if 'effortduration', 'callbacks', and 'applications' are all NaN.
Drop all duplicates
- Based on the desired result, it only matters if a user called/applied, once.
Groupby count

import pandas as pd

# sample data
df = pd.DataFrame({'user': ['user122', 'user122', 'user124', 'user125', 'user125', 'user126', 'user126'], 'effortduration' : ['2 weeks', np.nan, '2 weeks', '3 weeks', np.nan, '2 weeks', '2 weeks'], 'callbacks' : [0, 0, 0, 0, 0, 1, 1], 'applications': [0, 0, 1, 0, 0, 1, 1]})

# replace 0 and drop nan
df = df.replace(0, np.nan).dropna(how='all', subset=['effortduration', 'callbacks', 'applications'])

# drop duplicates
df = df.drop_duplicates()

# groupby and count
dfg = df.groupby(['effortduration']).count()

# dfg
                user  callbacks  applications
effortduration                               
2 weeks            3          1             2
3 weeks            1          0             0

`nunique`

As already noted, this option returns the count of the number of unique values in the column, so doesn't return the desired output.

df = pd.DataFrame({'user': ['user122', 'user122', 'user124', 'user125', 'user125', 'user126', 'user126'], 'effortduration' : ['2 weeks', np.nan, '2 weeks', '3 weeks', np.nan, '2 weeks', '2 weeks'], 'callbacks' : [0, 0, 0, 0, 0, 1, 1], 'applications': [0, 0, 1, 0, 0, 1, 1]})

# using nunique
dfg = df.groupby('effortduration').nunique()

# dfg
                user  callbacks  applications
effortduration                               
2 weeks            3          2             2
3 weeks            1          1             1

How to get unique counts based on a different column, with pandas groupby

Answers (1)

`nunique`

Related Questions