Chrestomanci
Chrestomanci

Reputation: 221

Create a column to denote # of occurrence based on ID, and take difference of dates based on ID

I have a CSV dataset of roughly 250k rows that is stored as a dataframe using pandas.

Each row is a record of a client coming in. A client could come in multiple times, which would result in multiple records being created. Some clients have only ever come in once, other clients come in dozens of times.

The CSV dataset has many columns that I use for other purposes, but the ones that this specific problem uses include:

CLIENT_ID   |   DATE_ARRIVED  
0001            1/01/2010
0002            1/02/2010
0001            2/01/2010
0001            2/22/2010
0002            4/01/2010

....

I am trying to create a new column that would assign a number denoting what # occurrence the row is based on the ID. Then if there is a # occurrence >1, have it take the difference of the dates in days from the prior occurrence.

Important note: The dataset is not ordered, so the script has to be able to determine which is the first based on the earliest date. If the client came in multiple times in the day, it would look at which is the earliest time within the date.

I tried to create a set using the CLIENT_ID, then looping through each element in the set to get the count. This gives me the total count, but I can't figure out how to get it to create a new column with those incrementally increasing counts.

I haven't gotten far enough to the DATE_ARRIVED differences based on # occurrence.

Nothing viable, hoping to get some ideas! If there is an easier way to determine differences between two dates next to each other for a client, I'm also open to ideas! I have a way of doing this manually through Excel, which involves:

  1. ordering the dataset by ID and date,
  2. checking each to see if the ID before was equal (and if it is, increment by 1)
  3. creating a new column that takes the difference of the above only if the previous number was >1 ... but I have no idea how to do this in Python.

The output should look something like:

CLIENT_ID   |   DATE_ARRIVED    |  OCCURRENCE    |  DAYS_SINCE_LAST
0001            1/01/2019          1                N/A
0002            1/02/2019          1                N/A
0001            2/01/2019          2                31
0001            2/22/2010          3                21
0002            4/01/2010          2                90

Upvotes: 0

Views: 95

Answers (1)

BENY
BENY

Reputation: 323236

Using groupby with transform count + diff

df['OCCURRENCE']=df.groupby('CLIENT_ID').CLIENT_ID.transform('count')
df['DAYS_SINCE_LAST']=df.groupby('CLIENT_ID')['DATE_ARRIVED'].diff().dt.days
df
Out[45]: 
   CLIENT_ID DATE_ARRIVED  OCCURRENCE  DAYS_SINCE_LAST
0          1   2010-01-01           2              NaN
1          2   2010-01-02           1              NaN
2          1   2010-02-01           2             31.0

Upvotes: 1

Related Questions