Reputation: 13
I am reading a csv file using pandas.
my dataframe consist of 3.7 million records and has two column: Date, Subscribers_ID
my dataframe data is the list of active subscribers per day.
I want to check what subscribers_id exist in day X and does not exist in day X + 1 so i can have a list of the subscribers_ID that are not inactive in day X + 1. And i want to do that for each day of the existing days.
is there any comparative function that do this or i have to create a new dataframe for each day and compare dataframes to each others. Because i have more than 75 days.
here is a sample of my data and what i want as result:
import pandas as pd
data = {'date':['22-Jan-22', '22-Jan-22', '22-Jan-22', '22-Jan-22', '23-Jan-22', '23-Jan-22', '23-Jan-22', '23-Jan-22', '23-Jan-22', '23-Jan-22', '23-Jan-22', '24-Jan-22', '24-Jan-22', '24-Jan-22', '24-Jan-22', '24-Jan-22', '24-Jan-22'], 'Subscriber_ID':['a', 'b', 'c', 'd', 'e', 'f', 'b', 'c', 'd', 'h', 'g', 'c', 'd', 'h', 'j', 'i', 'k']}
df = pd.DataFrame(data)
print(df)
I want to have the following result:
-Subscribers_ID lost in 23-Jan-22 is/are: a
-Subscribers_ID lost in 24-Jan-22 is/are: e, f, b, g
Upvotes: 1
Views: 202
Reputation: 261850
A simple idea would be to groupby
date and compute a set
of the users. Then compute the reverse diff
and shift
. This will give you the users that left.
(df.groupby('date')
['Subscriber_ID']
.apply(set)
.diff(-1)
.shift()
)
Output:
date
22-Jan-22 NaN
23-Jan-22 {a}
24-Jan-22 {g, b, e, f}
Name: Subscriber_ID, dtype: object
NB. Conveniently, if you do the same with a forward diff
, this will give you the new users:
(df.groupby('date')
['Subscriber_ID']
.apply(set)
.diff()
)
Output:
date
22-Jan-22 NaN
23-Jan-22 {g, h, e, f}
24-Jan-22 {i, k, j}
Name: Subscriber_ID, dtype: object
Upvotes: 1