Adriano Almeida
Adriano Almeida

Reputation: 5346

Pandas 'count(distinct)' equivalent

I am using Pandas as a database substitute as I have multiple databases (Oracle, SQL Server, etc.), and I am unable to make a sequence of commands to a SQL equivalent.

I have a table loaded in a DataFrame with some columns:

YEARMONTH, CLIENTCODE, SIZE, etc., etc.

In SQL, to count the amount of different clients per year would be:

SELECT count(distinct CLIENTCODE) FROM table GROUP BY YEARMONTH;

And the result would be

201301    5000
201302    13245

How can I do that in Pandas?

Upvotes: 417

Views: 762709

Answers (12)

shadowtalker
shadowtalker

Reputation: 13853

To count distinct values of more than one column, use len() on a DataFrameGroupBy.

For example, to count distinct values of (a, b):

len(df.groupby(["a", "b"]))

Equivalent SQL:

SELECT count(distinct (a, b))
FROM df

This can be done within groups by using the .apply groupby method.

For example, to count distinct values of (a, b) within group g:

data.groupby("g").apply(lambda df: len(df.groupby(["a", "b"]))

Equivalent SQL:

SELECT count(distinct (a, b))
FROM df
GROUP BY g

.apply is needed because .agg and .transform only allow you to apply an aggregation or transformation columnwise.

Upvotes: 0

Panwen Wang
Panwen Wang

Reputation: 3825

Now you are also able to use dplyr syntax in Python to do it:

>>> from datar.all import f, tibble, group_by, summarise, n_distinct
>>>
>>> data = tibble(
...     CLIENT_CODE=[1,1,2,1,2,2,3],
...     YEAR_MONTH=[201301,201301,201301,201302,201302,201302,201302]
... )
>>>
>>> data >> group_by(f.YEAR_MONTH) >> summarise(n=n_distinct(f.CLIENT_CODE))
   YEAR_MONTH       n
      <int64> <int64>
0      201301       2
1      201302       3

Upvotes: 1

cottontail
cottontail

Reputation: 23131

  • To obtain the number of different clients and sizes per year (i.e. number of unique values of multiple columns), use a list of columns:
    df.groupby('YEARMONTH')[['CLIENTCODE', 'SIZE']].nunique()
    
  • Actually, the result from the above code can be obtained using SQL syntax on df using pandasql (a module built on pandas that lets you query pandas DataFrames using SQL syntax).
    #! pip install pandasql
    from pandasql import sqldf
    sqldf("""
    SELECT COUNT(DISTINCT CLIENTCODE), 
           COUNT(DISTINCT SIZE)
      FROM df 
      GROUP BY YEARMONTH
    """)
    
  • If you want to keep YEARMONTH as a column, i.e. the analog of the following SQL query
    SELECT YEARMONTH, 
           COUNT(DISTINCT CLIENTCODE),
           COUNT(DISTINCT SIZE)
      FROM df 
      GROUP BY YEARMONTH
    
    in pandas is the following (set as_index to False):
    df.groupby('YEARMONTH', as_index=False)[['CLIENTCODE', 'SIZE']].nunique()
    
  • If you need to set custom names to the aggregated columns, i.e. the analog of the following SQL query:
    SELECT YEARMONTH, 
           COUNT(DISTINCT CLIENTCODE) AS `No. clients`, 
           COUNT(DISTINCT SIZE) AS `No. size`
      FROM df 
      GROUP BY YEARMONTH
    
    use named aggregation in pandas:
    (
        df.groupby('YEARMONTH', as_index=False)
          .agg(**{'No. clients':('CLIENTCODE', 'nunique'), 
                  'No. size':('SIZE', 'nunique')})
    )
    

Upvotes: 0

Create a pivot table and use the nunique series function:

ID = [ 123, 123, 123, 456, 456, 456, 456, 789, 789]
domain = ['vk.com', 'vk.com', 'twitter.com', 'vk.com', 'facebook.com',
          'vk.com', 'google.com', 'twitter.com', 'vk.com']
df = pd.DataFrame({'id':ID, 'domain':domain})
fp = pd.pivot_table(data=df, index='domain', aggfunc=pd.Series.nunique)
print(fp)

Output:

               id
domain
facebook.com   1
google.com     1
twitter.com    2
vk.com         3

Upvotes: 1

Ramon
Ramon

Reputation: 538

Here is an approach to have count distinct over multiple columns. Let's have some data:

data = {'CLIENT_CODE':[1,1,2,1,2,2,3],
        'YEAR_MONTH':[201301,201301,201301,201302,201302,201302,201302],
        'PRODUCT_CODE': [100,150,220,400,50,80,100]
       }
table = pd.DataFrame(data)
table

CLIENT_CODE YEAR_MONTH  PRODUCT_CODE
0   1       201301      100
1   1       201301      150
2   2       201301      220
3   1       201302      400
4   2       201302      50
5   2       201302      80
6   3       201302      100

Now, list the columns of interest and use groupby in a slightly modified syntax:

columns = ['YEAR_MONTH', 'PRODUCT_CODE']
table[columns].groupby(table['CLIENT_CODE']).nunique()

We obtain:

YEAR_MONTH  PRODUCT_CODE CLIENT_CODE
1           2            3
2           2            3
3           1            1

Upvotes: 4

Wickkiey
Wickkiey

Reputation: 4632

With the new Pandas version, it is easy to get as a data frame:

unique_count = pd.groupby(['YEARMONTH'], as_index=False).agg(uniq_CLIENTCODE=('CLIENTCODE', pd.Series.count))

Upvotes: 1

BENY
BENY

Reputation: 323226

Using crosstab, this will return more information than groupby nunique:

pd.crosstab(df.YEARMONTH,df.CLIENTCODE)
Out[196]:
CLIENTCODE  1  2  3
YEARMONTH
201301      2  1  0
201302      1  2  1

After a little bit of modification, it yields the result:

pd.crosstab(df.YEARMONTH,df.CLIENTCODE).ne(0).sum(1)
Out[197]:
YEARMONTH
201301    2
201302    3
dtype: int64

Upvotes: 5

StatguyUser
StatguyUser

Reputation: 2665

Here is another method and it is much simpler. Let’s say your dataframe name is daat and the column name is YEARMONTH:

daat.YEARMONTH.value_counts()

Upvotes: 134

Vivek Payasi
Vivek Payasi

Reputation: 619

Distinct of column along with aggregations on other columns

To get the distinct number of values for any column (CLIENTCODE in your case), we can use nunique. We can pass the input as a dictionary in agg function, along with aggregations on other columns:

grp_df = df.groupby('YEARMONTH').agg({'CLIENTCODE': ['nunique'],
                                      'other_col_1': ['sum', 'count']})

# to flatten the multi-level columns
grp_df.columns = ["_".join(col).strip() for col in grp_df.columns.values]

# if you wish to reset the index
grp_df.reset_index(inplace=True)

Upvotes: 18

Gangaraju
Gangaraju

Reputation: 4562

I am also using nunique but it will be very helpful if you have to use an aggregate function like 'min', 'max', 'count' or 'mean' etc.

df.groupby('YEARMONTH')['CLIENTCODE'].transform('nunique') #count(distinct)
df.groupby('YEARMONTH')['CLIENTCODE'].transform('min')     #min
df.groupby('YEARMONTH')['CLIENTCODE'].transform('max')     #max
df.groupby('YEARMONTH')['CLIENTCODE'].transform('mean')    #average
df.groupby('YEARMONTH')['CLIENTCODE'].transform('count')   #count

Upvotes: 20

Roman Kh
Roman Kh

Reputation: 2735

Interestingly enough, very often len(unique()) is a few times (3x-15x) faster than nunique().

Upvotes: 56

Dan Allan
Dan Allan

Reputation: 35235

I believe this is what you want:

table.groupby('YEARMONTH').CLIENTCODE.nunique()

Example:

In [2]: table
Out[2]: 
   CLIENTCODE  YEARMONTH
0           1     201301
1           1     201301
2           2     201301
3           1     201302
4           2     201302
5           2     201302
6           3     201302

In [3]: table.groupby('YEARMONTH').CLIENTCODE.nunique()
Out[3]: 
YEARMONTH
201301       2
201302       3

Upvotes: 598

Related Questions