pkfly103
pkfly103

Reputation: 101

How to count the number of unique elements in a column in a Pandas dataframe

Say I have a data frame that records a customer (denoted by a letter) and the date that they arrived in a store.

    customer       date
       A        2010-01-01
       B        2010-01-01
       A        2010-01-02
       C        2010-01-02
       D        2010-01-03
       D        2010-01-03

I want to count the number of unique customers that come in to the store and display the cumulative amount of unique customers each day. So I don't want to double count a customer that's already been in the store in the past.

The desired output would be something like this:

     date         cumulative unique customers to date
  2010-01-01                     2   # A, B
  2010-01-02                     3   # A, B, C
  2010-01-03                     4   # A, B, C, D

So after 3 days, we have 4 distinct customers that have arrived

So far, the closest I've come is using the groupby() function:

df.groupby('Date')['Customer'].nunique() which gives me

     date                 unique customers per date
  2010-01-01                     2   # A, B
  2010-01-02                     2   # A, C
  2010-01-03                     1   # D

However, this is just the number of unique customers per day.

I'm thinking I need to create a new column and a function that goes through each row and places a 1 into the new column if it's the first visit of the customer in that row, 0 otherwise. Then apply the cumsum() method on this new column (for each day) to get the desired outcome.

I'm having problems coming up with code that takes a value in the 'customer' column and compares it to every other value in the column to determine if it's the first of its kind - or is this even necessary?

Upvotes: 2

Views: 822

Answers (2)

sushanth
sushanth

Reputation: 8302

try this, drop_duplicates along with groupby

>>> df.drop_duplicates(["customer"]).groupby("date")['customer'].nunique().cumsum()
date
2010-01-01    2
2010-01-02    3
2010-01-03    4
Name: customer, dtype: int64

Upvotes: 3

ALollz
ALollz

Reputation: 59579

Use crosstab to convert this into an existence DataFrame for each customer on every date. With cummax we can fill forward all dates since they appear and this becomes a simple sum to get unique customers.

pd.crosstab(df.date, df.customer).clip(upper=1)
#customer    A  B  C  D
#date                  
#2010-01-01  1  1  0  0
#2010-01-02  1  0  1  0
#2010-01-03  0  0  0  1

pd.crosstab(df.date, df.customer).clip(upper=1).cummax()
#customer    A  B  C  D
#date                  
#2010-01-01  1  1  0  0
#2010-01-02  1  1  1  0
#2010-01-03  1  1  1  1

pd.crosstab(df.date, df.customer).clip(upper=1).cummax().sum(1)
#date
#2010-01-01    2
#2010-01-02    3
#2010-01-03    4
#dtype: int64

Upvotes: 0

Related Questions