Reputation: 101
Say I have a data frame that records a customer (denoted by a letter) and the date that they arrived in a store.
customer date
A 2010-01-01
B 2010-01-01
A 2010-01-02
C 2010-01-02
D 2010-01-03
D 2010-01-03
I want to count the number of unique customers that come in to the store and display the cumulative amount of unique customers each day. So I don't want to double count a customer that's already been in the store in the past.
The desired output would be something like this:
date cumulative unique customers to date
2010-01-01 2 # A, B
2010-01-02 3 # A, B, C
2010-01-03 4 # A, B, C, D
So after 3 days, we have 4 distinct customers that have arrived
So far, the closest I've come is using the groupby()
function:
df.groupby('Date')['Customer'].nunique()
which gives me
date unique customers per date
2010-01-01 2 # A, B
2010-01-02 2 # A, C
2010-01-03 1 # D
However, this is just the number of unique customers per day.
I'm thinking I need to create a new column and a function that goes through each row and places a 1 into the new column if it's the first visit of the customer in that row, 0 otherwise. Then apply the cumsum() method on this new column (for each day) to get the desired outcome.
I'm having problems coming up with code that takes a value in the 'customer' column and compares it to every other value in the column to determine if it's the first of its kind - or is this even necessary?
Upvotes: 2
Views: 822
Reputation: 8302
try this, drop_duplicates
along with groupby
>>> df.drop_duplicates(["customer"]).groupby("date")['customer'].nunique().cumsum()
date
2010-01-01 2
2010-01-02 3
2010-01-03 4
Name: customer, dtype: int64
Upvotes: 3
Reputation: 59579
Use crosstab
to convert this into an existence DataFrame for each customer on every date. With cummax
we can fill forward all dates since they appear and this becomes a simple sum
to get unique customers.
pd.crosstab(df.date, df.customer).clip(upper=1)
#customer A B C D
#date
#2010-01-01 1 1 0 0
#2010-01-02 1 0 1 0
#2010-01-03 0 0 0 1
pd.crosstab(df.date, df.customer).clip(upper=1).cummax()
#customer A B C D
#date
#2010-01-01 1 1 0 0
#2010-01-02 1 1 1 0
#2010-01-03 1 1 1 1
pd.crosstab(df.date, df.customer).clip(upper=1).cummax().sum(1)
#date
#2010-01-01 2
#2010-01-02 3
#2010-01-03 4
#dtype: int64
Upvotes: 0