Slavka
Slavka

Reputation: 1148

Calculate nunique() for groupby in pandas

I have a dataframe with columns:

  1. diff - difference between registration date and payment date,in days
  2. country - country of user
  3. user_id
  4. campaign_id -- another categorical column, we will use it in groupby

I need to calculate count distinct users for every country+campaign_id group who has diff<=n. For example, for country 'A', campaign 'abc' and diff 7 i need to get count distinct users from country 'A', campaign 'abc' and diff <= 7

My current solution(below) works too long

import pandas as pd
import numpy as np

## generate test dataframe
df = pd.DataFrame({
        'country':np.random.choice(['A', 'B', 'C', 'D'], 10000),
        'campaign': np.random.choice(['camp1', 'camp2', 'camp3', 'camp4', 'camp5', 'camp6'], 10000),
        'diff':np.random.choice(range(10), 10000),
        'user_id': np.random.choice(range(1000), 10000)
        })
## main
result_df = pd.DataFrame()
for diff in df['diff'].unique():
    tmp_df = df.loc[df['diff']<=diff,:]
    tmp_df = tmp_df.groupby(['country', 'campaign'], as_index=False).apply(lambda x: x.user_id.nunique()).reset_index()
    tmp_df['diff'] = diff
    tmp_df.columns=['country', 'campaign', 'unique_ppl', 'diff']
    result_df = pd.concat([result_df, tmp_df],ignore_index=True, axis=0)

Maybe there is better way to do this?

Upvotes: 8

Views: 22507

Answers (2)

jpp
jpp

Reputation: 164713

One alternative below, but @jezrael's solution is optimal.

Performance benchmarking

%timeit original(df)  # 149ms
%timeit jp(df)        # 81ms
%timeit jez(df)       # 47ms

def original(df):
    result_df = pd.DataFrame()
    for diff in df['diff'].unique():
        tmp_df = df.loc[df['diff']<=diff,:]
        tmp_df = tmp_df.groupby(['country', 'campaign'], as_index=False).apply(lambda x: x.user_id.nunique()).reset_index()
        tmp_df['diff'] = diff
        tmp_df.columns=['country', 'campaign', 'unique_ppl', 'diff']
        result_df = pd.concat([result_df, tmp_df],ignore_index=True, axis=0)

    return result_df

def jp(df):

    result_df = pd.DataFrame()
    lst = []
    lst_append = lst.append
    for diff in df['diff'].unique():
        tmp_df = df.loc[df['diff']<=diff,:]
        tmp_df = tmp_df.groupby(['country', 'campaign'], as_index=False).agg({'user_id': 'nunique'})
        tmp_df['diff'] = diff
        tmp_df.columns=['country', 'campaign', 'unique_ppl', 'diff']
        lst_append(tmp_df)

    result_df = result_df.append(pd.concat(lst, ignore_index=True, axis=0), ignore_index=True)

    return result_df

def jez(df):
    df1 = pd.concat([df.loc[df['diff']<=x].assign(diff=x) for x in  df['diff'].unique()])
    df2 = (df1.groupby(['diff','country', 'campaign'], sort=False)['user_id']
              .nunique()
              .reset_index()
              .rename(columns={'user_id':'unique_ppl'})
              .reindex(columns=['country', 'campaign', 'unique_ppl', 'diff']))
    return df2

Upvotes: 4

jezrael
jezrael

Reputation: 862911

First use list comprehension with concat and assign for join all together and then groupby with nunique with adding column diff, last rename columns and if necessary add reindex for custom columns order:

df1 = pd.concat([df.loc[df['diff']<=x].assign(diff=x) for x in  df['diff'].unique()])
df2 = (df1.groupby(['diff','country', 'campaign'], sort=False)['user_id']
          .nunique()
          .reset_index()
          .rename(columns={'user_id':'unique_ppl'})
          .reindex(columns=['country', 'campaign', 'unique_ppl', 'diff']))

Upvotes: 8

Related Questions