Jabb
Jabb

Reputation: 3502

How to get the distinct count of values in a python pandas dataframe

I would like to get the distinct count of values in a python pandas dataframe and write the result to a new column. This is what I have so far.

import pandas as pd

df = pd.DataFrame( {
   'OrderNo': [1,1,1,1,2,2,2,3,3],
   'Barcode': [1234,2345,3456,3456,1234,1234,2345,1234,3456]
    } );

df['barcodeCountPerOrderNo'] = df.groupby(['OrderNo', 'Barcode'])['Barcode'].transform('count')

df['distinctBarcodesPerOrderNo'] = '?'

print df

This gives:

   Barcode  OrderNo  barcodeCountPerOrderNo distinctBarcodesPerOrder
0     1234        1                       1                       ?
1     2345        1                       1                       ?
2     3456        1                       2                       ?
3     3456        1                       2                       ?
4     1234        2                       2                       ?
5     1234        2                       2                       ?
6     2345        2                       1                       ?
7     1234        3                       1                       ?
8     3456        3                       1                       ?

But how can I get the distinctBarcodesPerOrder?

   Barcode  OrderNo  barcodeCountPerOrderNo distinctBarcodesPerOrder
0     1234        1                       1                       3
1     2345        1                       1                       3
2     3456        1                       2                       3
3     3456        1                       2                       3
4     1234        2                       2                       2
5     1234        2                       2                       2
6     2345        2                       1                       2
7     1234        3                       1                       2
8     3456        3                       1                       2

Upvotes: 2

Views: 5074

Answers (3)

Allen Qin
Allen Qin

Reputation: 19947

#If you want a one-liner, you can use apply to get the distinctBarcodesPerOrder for each row. Although this method might be a slow on large dataset. 

df['distinctBarcodesPerOrder'] = df.apply(lambda x: df.loc[df.OrderNo==x.OrderNo,'Barcode'].nunique(), axis=1)

df
Out[237]: 
   Barcode  OrderNo  barcodeCountPerOrderNo  distinctBarcodesPerOrder
0     1234        1                       1                         3
1     2345        1                       1                         3
2     3456        1                       2                         3
3     3456        1                       2                         3
4     1234        2                       2                         2
5     1234        2                       2                         2
6     2345        2                       1                         2
7     1234        3                       1                         2
8     3456        3                       1                         2

Upvotes: 1

Fabio Lamanna
Fabio Lamanna

Reputation: 21552

I would use map to get unique values and directly merge them into the original dataframe:

df['distinctBarcodesPerOrder'] = df['OrderNo'].map(df.groupby('OrderNo')['Barcode'].nunique())

which returns:

   Barcode  OrderNo  barcodeCountPerOrderNo  distinctBarcodesPerOrder
0     1234        1                       1                         3
1     2345        1                       1                         3
2     3456        1                       2                         3
3     3456        1                       2                         3
4     1234        2                       2                         2
5     1234        2                       2                         2
6     2345        2                       1                         2
7     1234        3                       1                         2
8     3456        3                       1                         2

Upvotes: 1

lanenok
lanenok

Reputation: 2749

You can use nunique to calculate the number of unique barcodes per order

Barcode_distinct = df.groupby('OrderNo')['Barcode'].nunique()

the result is pandas Series

> OrderNo
> 1    3
> 2    2
> 3    2
> Name: Barcode, dtype: int64

then you merge this with the original DataFrame

df.merge(Barcode_distinct.to_frame(), left_on='OrderNo', right_index=True, suffixes=('', '_unique_per_OrderNo'))

the results is

>    Barcode  OrderNo  Barcode_unique_per_OrderNo
> 0     1234        1                           3
> 1     2345        1                           3
> 2     3456        1                           3
> 3     3456        1                           3
> 4     1234        2                           2
> 5     1234        2                           2
> 6     2345        2                           2
> 7     1234        3                           2
> 8     3456        3                           2

Upvotes: 3

Related Questions