Reputation: 221
I have a CSV dataset of roughly 250k rows that is stored as a dataframe using pandas.
Each row is a record of a client coming in. A client could come in multiple times, which would result in multiple records being created. Some clients have only ever come in once, other clients come in dozens of times.
The CSV dataset has many columns that I use for other purposes, but the ones that this specific problem uses include:
CLIENT_ID | DATE_ARRIVED
0001 1/01/2010
0002 1/02/2010
0001 2/01/2010
0001 2/22/2010
0002 4/01/2010
....
I am trying to create a new column that would assign a number denoting what # occurrence the row is based on the ID. Then if there is a # occurrence >1, have it take the difference of the dates in days from the prior occurrence.
Important note: The dataset is not ordered, so the script has to be able to determine which is the first based on the earliest date. If the client came in multiple times in the day, it would look at which is the earliest time within the date.
I tried to create a set using the CLIENT_ID, then looping through each element in the set to get the count. This gives me the total count, but I can't figure out how to get it to create a new column with those incrementally increasing counts.
I haven't gotten far enough to the DATE_ARRIVED differences based on # occurrence.
Nothing viable, hoping to get some ideas! If there is an easier way to determine differences between two dates next to each other for a client, I'm also open to ideas! I have a way of doing this manually through Excel, which involves:
The output should look something like:
CLIENT_ID | DATE_ARRIVED | OCCURRENCE | DAYS_SINCE_LAST
0001 1/01/2019 1 N/A
0002 1/02/2019 1 N/A
0001 2/01/2019 2 31
0001 2/22/2010 3 21
0002 4/01/2010 2 90
Upvotes: 0
Views: 95
Reputation: 323236
Using groupby
with transform
count
+ diff
df['OCCURRENCE']=df.groupby('CLIENT_ID').CLIENT_ID.transform('count')
df['DAYS_SINCE_LAST']=df.groupby('CLIENT_ID')['DATE_ARRIVED'].diff().dt.days
df
Out[45]:
CLIENT_ID DATE_ARRIVED OCCURRENCE DAYS_SINCE_LAST
0 1 2010-01-01 2 NaN
1 2 2010-01-02 1 NaN
2 1 2010-02-01 2 31.0
Upvotes: 1