Reputation: 3502
I would like to get the distinct count of values in a python pandas dataframe and write the result to a new column. This is what I have so far.
import pandas as pd
df = pd.DataFrame( {
'OrderNo': [1,1,1,1,2,2,2,3,3],
'Barcode': [1234,2345,3456,3456,1234,1234,2345,1234,3456]
} );
df['barcodeCountPerOrderNo'] = df.groupby(['OrderNo', 'Barcode'])['Barcode'].transform('count')
df['distinctBarcodesPerOrderNo'] = '?'
print df
This gives:
Barcode OrderNo barcodeCountPerOrderNo distinctBarcodesPerOrder
0 1234 1 1 ?
1 2345 1 1 ?
2 3456 1 2 ?
3 3456 1 2 ?
4 1234 2 2 ?
5 1234 2 2 ?
6 2345 2 1 ?
7 1234 3 1 ?
8 3456 3 1 ?
But how can I get the distinctBarcodesPerOrder?
Barcode OrderNo barcodeCountPerOrderNo distinctBarcodesPerOrder
0 1234 1 1 3
1 2345 1 1 3
2 3456 1 2 3
3 3456 1 2 3
4 1234 2 2 2
5 1234 2 2 2
6 2345 2 1 2
7 1234 3 1 2
8 3456 3 1 2
Upvotes: 2
Views: 5074
Reputation: 19947
#If you want a one-liner, you can use apply to get the distinctBarcodesPerOrder for each row. Although this method might be a slow on large dataset.
df['distinctBarcodesPerOrder'] = df.apply(lambda x: df.loc[df.OrderNo==x.OrderNo,'Barcode'].nunique(), axis=1)
df
Out[237]:
Barcode OrderNo barcodeCountPerOrderNo distinctBarcodesPerOrder
0 1234 1 1 3
1 2345 1 1 3
2 3456 1 2 3
3 3456 1 2 3
4 1234 2 2 2
5 1234 2 2 2
6 2345 2 1 2
7 1234 3 1 2
8 3456 3 1 2
Upvotes: 1
Reputation: 21552
I would use map
to get unique values and directly merge them into the original dataframe:
df['distinctBarcodesPerOrder'] = df['OrderNo'].map(df.groupby('OrderNo')['Barcode'].nunique())
which returns:
Barcode OrderNo barcodeCountPerOrderNo distinctBarcodesPerOrder
0 1234 1 1 3
1 2345 1 1 3
2 3456 1 2 3
3 3456 1 2 3
4 1234 2 2 2
5 1234 2 2 2
6 2345 2 1 2
7 1234 3 1 2
8 3456 3 1 2
Upvotes: 1
Reputation: 2749
You can use nunique
to calculate the number of unique barcodes per order
Barcode_distinct = df.groupby('OrderNo')['Barcode'].nunique()
the result is pandas Series
> OrderNo
> 1 3
> 2 2
> 3 2
> Name: Barcode, dtype: int64
then you merge this with the original DataFrame
df.merge(Barcode_distinct.to_frame(), left_on='OrderNo', right_index=True, suffixes=('', '_unique_per_OrderNo'))
the results is
> Barcode OrderNo Barcode_unique_per_OrderNo
> 0 1234 1 3
> 1 2345 1 3
> 2 3456 1 3
> 3 3456 1 3
> 4 1234 2 2
> 5 1234 2 2
> 6 2345 2 2
> 7 1234 3 2
> 8 3456 3 2
Upvotes: 3