Reputation: 5346
I am using Pandas as a database substitute as I have multiple databases (Oracle, SQL Server, etc.), and I am unable to make a sequence of commands to a SQL equivalent.
I have a table loaded in a DataFrame with some columns:
YEARMONTH, CLIENTCODE, SIZE, etc., etc.
In SQL, to count the amount of different clients per year would be:
SELECT count(distinct CLIENTCODE) FROM table GROUP BY YEARMONTH;
And the result would be
201301 5000
201302 13245
How can I do that in Pandas?
Upvotes: 417
Views: 762709
Reputation: 13853
To count distinct values of more than one column, use len()
on a DataFrameGroupBy
.
For example, to count distinct values of (a, b)
:
len(df.groupby(["a", "b"]))
Equivalent SQL:
SELECT count(distinct (a, b))
FROM df
This can be done within groups by using the .apply
groupby method.
For example, to count distinct values of (a, b)
within group g
:
data.groupby("g").apply(lambda df: len(df.groupby(["a", "b"]))
Equivalent SQL:
SELECT count(distinct (a, b))
FROM df
GROUP BY g
.apply
is needed because .agg
and .transform
only allow you to apply an aggregation or transformation columnwise.
Upvotes: 0
Reputation: 3825
Now you are also able to use dplyr syntax in Python to do it:
>>> from datar.all import f, tibble, group_by, summarise, n_distinct
>>>
>>> data = tibble(
... CLIENT_CODE=[1,1,2,1,2,2,3],
... YEAR_MONTH=[201301,201301,201301,201302,201302,201302,201302]
... )
>>>
>>> data >> group_by(f.YEAR_MONTH) >> summarise(n=n_distinct(f.CLIENT_CODE))
YEAR_MONTH n
<int64> <int64>
0 201301 2
1 201302 3
Upvotes: 1
Reputation: 23131
df.groupby('YEARMONTH')[['CLIENTCODE', 'SIZE']].nunique()
df
using pandasql
(a module built on pandas that lets you query pandas DataFrames using SQL syntax).
#! pip install pandasql
from pandasql import sqldf
sqldf("""
SELECT COUNT(DISTINCT CLIENTCODE),
COUNT(DISTINCT SIZE)
FROM df
GROUP BY YEARMONTH
""")
SELECT YEARMONTH,
COUNT(DISTINCT CLIENTCODE),
COUNT(DISTINCT SIZE)
FROM df
GROUP BY YEARMONTH
in pandas is the following (set as_index
to False
):
df.groupby('YEARMONTH', as_index=False)[['CLIENTCODE', 'SIZE']].nunique()
SELECT YEARMONTH,
COUNT(DISTINCT CLIENTCODE) AS `No. clients`,
COUNT(DISTINCT SIZE) AS `No. size`
FROM df
GROUP BY YEARMONTH
use named aggregation in pandas:
(
df.groupby('YEARMONTH', as_index=False)
.agg(**{'No. clients':('CLIENTCODE', 'nunique'),
'No. size':('SIZE', 'nunique')})
)
Upvotes: 0
Reputation: 4243
Create a pivot table and use the nunique series function:
ID = [ 123, 123, 123, 456, 456, 456, 456, 789, 789]
domain = ['vk.com', 'vk.com', 'twitter.com', 'vk.com', 'facebook.com',
'vk.com', 'google.com', 'twitter.com', 'vk.com']
df = pd.DataFrame({'id':ID, 'domain':domain})
fp = pd.pivot_table(data=df, index='domain', aggfunc=pd.Series.nunique)
print(fp)
Output:
id
domain
facebook.com 1
google.com 1
twitter.com 2
vk.com 3
Upvotes: 1
Reputation: 538
Here is an approach to have count distinct over multiple columns. Let's have some data:
data = {'CLIENT_CODE':[1,1,2,1,2,2,3],
'YEAR_MONTH':[201301,201301,201301,201302,201302,201302,201302],
'PRODUCT_CODE': [100,150,220,400,50,80,100]
}
table = pd.DataFrame(data)
table
CLIENT_CODE YEAR_MONTH PRODUCT_CODE
0 1 201301 100
1 1 201301 150
2 2 201301 220
3 1 201302 400
4 2 201302 50
5 2 201302 80
6 3 201302 100
Now, list the columns of interest and use groupby in a slightly modified syntax:
columns = ['YEAR_MONTH', 'PRODUCT_CODE']
table[columns].groupby(table['CLIENT_CODE']).nunique()
We obtain:
YEAR_MONTH PRODUCT_CODE CLIENT_CODE
1 2 3
2 2 3
3 1 1
Upvotes: 4
Reputation: 4632
With the new Pandas version, it is easy to get as a data frame:
unique_count = pd.groupby(['YEARMONTH'], as_index=False).agg(uniq_CLIENTCODE=('CLIENTCODE', pd.Series.count))
Upvotes: 1
Reputation: 323226
Using crosstab
, this will return more information than groupby
nunique
:
pd.crosstab(df.YEARMONTH,df.CLIENTCODE)
Out[196]:
CLIENTCODE 1 2 3
YEARMONTH
201301 2 1 0
201302 1 2 1
After a little bit of modification, it yields the result:
pd.crosstab(df.YEARMONTH,df.CLIENTCODE).ne(0).sum(1)
Out[197]:
YEARMONTH
201301 2
201302 3
dtype: int64
Upvotes: 5
Reputation: 2665
Here is another method and it is much simpler. Let’s say your dataframe name is daat
and the column name is YEARMONTH
:
daat.YEARMONTH.value_counts()
Upvotes: 134
Reputation: 619
To get the distinct number of values for any column (CLIENTCODE
in your case), we can use nunique
. We can pass the input as a dictionary in agg
function, along with aggregations on other columns:
grp_df = df.groupby('YEARMONTH').agg({'CLIENTCODE': ['nunique'],
'other_col_1': ['sum', 'count']})
# to flatten the multi-level columns
grp_df.columns = ["_".join(col).strip() for col in grp_df.columns.values]
# if you wish to reset the index
grp_df.reset_index(inplace=True)
Upvotes: 18
Reputation: 4562
I am also using nunique
but it will be very helpful if you have to use an aggregate function like 'min', 'max', 'count' or 'mean'
etc.
df.groupby('YEARMONTH')['CLIENTCODE'].transform('nunique') #count(distinct)
df.groupby('YEARMONTH')['CLIENTCODE'].transform('min') #min
df.groupby('YEARMONTH')['CLIENTCODE'].transform('max') #max
df.groupby('YEARMONTH')['CLIENTCODE'].transform('mean') #average
df.groupby('YEARMONTH')['CLIENTCODE'].transform('count') #count
Upvotes: 20
Reputation: 2735
Interestingly enough, very often len(unique())
is a few times (3x-15x) faster than nunique()
.
Upvotes: 56
Reputation: 35235
I believe this is what you want:
table.groupby('YEARMONTH').CLIENTCODE.nunique()
Example:
In [2]: table
Out[2]:
CLIENTCODE YEARMONTH
0 1 201301
1 1 201301
2 2 201301
3 1 201302
4 2 201302
5 2 201302
6 3 201302
In [3]: table.groupby('YEARMONTH').CLIENTCODE.nunique()
Out[3]:
YEARMONTH
201301 2
201302 3
Upvotes: 598