anto5
anto5

Reputation: 13

Pandas dataframe comparing with no index

I am reading a csv file using pandas.

my dataframe consist of 3.7 million records and has two column: Date, Subscribers_ID

my dataframe data is the list of active subscribers per day.

I want to check what subscribers_id exist in day X and does not exist in day X + 1 so i can have a list of the subscribers_ID that are not inactive in day X + 1. And i want to do that for each day of the existing days.

is there any comparative function that do this or i have to create a new dataframe for each day and compare dataframes to each others. Because i have more than 75 days.

here is a sample of my data and what i want as result:

import pandas as pd
 
data = {'date':['22-Jan-22', '22-Jan-22', '22-Jan-22', '22-Jan-22', '23-Jan-22', '23-Jan-22', '23-Jan-22', '23-Jan-22', '23-Jan-22', '23-Jan-22', '23-Jan-22', '24-Jan-22', '24-Jan-22', '24-Jan-22', '24-Jan-22', '24-Jan-22', '24-Jan-22'], 'Subscriber_ID':['a', 'b', 'c', 'd', 'e', 'f', 'b', 'c', 'd', 'h', 'g', 'c', 'd', 'h', 'j', 'i', 'k']}
 
df = pd.DataFrame(data)
 
print(df)

I want to have the following result:

-Subscribers_ID lost in 23-Jan-22 is/are: a

-Subscribers_ID lost in 24-Jan-22 is/are: e, f, b, g

Upvotes: 1

Views: 202

Answers (1)

mozway
mozway

Reputation: 261850

A simple idea would be to groupby date and compute a set of the users. Then compute the reverse diff and shift. This will give you the users that left.

(df.groupby('date')
   ['Subscriber_ID']
   .apply(set)
   .diff(-1)
   .shift()
 )

Output:

date
22-Jan-22             NaN
23-Jan-22             {a}
24-Jan-22    {g, b, e, f}
Name: Subscriber_ID, dtype: object

NB. Conveniently, if you do the same with a forward diff, this will give you the new users:

(df.groupby('date')
   ['Subscriber_ID']
   .apply(set)
   .diff()
 )

Output:

date
22-Jan-22             NaN
23-Jan-22    {g, h, e, f}
24-Jan-22       {i, k, j}
Name: Subscriber_ID, dtype: object

Upvotes: 1

Related Questions