4212extra
4212extra

Reputation: 81

Date difference from a list in pandas dataframe

I have a pandas dataframe for text data. I created by doing group by and aggregate to get the texts per id like below. I later calculated the word count.

df = df.groupby('id') \
         .agg({'chat': ', '.join }) \
         .reset_index()

It looks like this: chat is the collection of the text data per id. The created_at is the dates of chats, converted to string type.

|id|chat      |word count|created_at                                                 |
|23|hi,hey!,hi|3         |2018-11-09 02:11:24,2018-11-09 02:11:43,2018-11-09 03:13:22|
|24|look there|2         |2017-11-03 18:05:34,2017-11-06 18:03:22                    |
|25|thank you!|2         |2017-11-07 09:18:01,2017-11-18 11:09:37                    |

I want to change add a chat duration column that gives the difference between first date and last date in days as integer.If chat ends same day then 1. The new expected column is :-

|chat_duration|
|1            |
|3            |
|11           |

Copying to clipboard looks like this before the group by

 ,id,chat,created_at
0,23,"hi",2018-11-09 02:11:24
1,23,"hey!",2018-11-09 02:11:43
2,23,"hi",2018-11-09 03:13:22

Upvotes: 0

Views: 128

Answers (3)

Trenton McKinney
Trenton McKinney

Reputation: 62543

If I were doing the entire process

  1. Beginning with the unprocessed data
id,chat,created_at
23,"hi i'm at school",2018-11-09 02:11:24
23,"hey! how are you",2018-11-09 02:11:43
23,"hi mom",2018-11-09 03:13:22
24,"leaving home",2018-11-09 02:11:24
24,"not today",2018-11-09 02:11:43
24,"i'll be back",2018-11-10 03:13:22
25,"yesterday i had",2018-11-09 02:11:24
25,"it's to hot",2018-11-09 02:11:43
25,"see you later",2018-11-12 03:13:22

# create the dataframe with this data on the clipboard
df = pd.read_clipboard(sep=',')
  1. set created_at to datetime
df.created_at = pd.to_datetime(df.created_at)
  1. create word_count
df['word_count'] = df.chat.str.split(' ').map(len)
  1. groupby agg to get all chat as a string, created_at as a list, and word_cound as a total sum.
df = df.groupby('id').agg({'chat': ','.join , 'created_at': list, 'word_count': sum}).reset_index()
  1. calculate chat_duration
df['chat_duration'] = df['created_at'].apply(lambda x: (max(x) - min(x)).days)
  1. convert created_at to desired string format
    • If you skip this step, created_at will be a list of datetimes.
df['created_at'] = df['created_at'].apply(lambda x: ','.join([y.strftime("%m/%d/%Y %H:%M:%S") for y in x]))

Final df

|    |   id | chat                                      | created_at                                                  |   word_count |   chat_duration |
|---:|-----:|:------------------------------------------|:------------------------------------------------------------|-------------:|----------------:|
|  0 |   23 | hi i'm at school,hey! how are you,hi mom  | 11/09/2018 02:11:24,11/09/2018 02:11:43,11/09/2018 03:13:22 |           10 |               0 |
|  1 |   24 | leaving home,not today,i'll be back       | 11/09/2018 02:11:24,11/09/2018 02:11:43,11/10/2018 03:13:22 |            7 |               1 |
|  2 |   25 | yesterday i had,it's to hot,see you later | 11/09/2018 02:11:24,11/09/2018 02:11:43,11/12/2018 03:13:22 |            9 |               3 |

Upvotes: 1

jezrael
jezrael

Reputation: 863701

Create DataFrame by split and then subtract first and last columns converted to datetimes:

df1 = df['created_at'].str.split(',', expand=True).ffill(axis=1)
df['created_at'] = (pd.to_datetime(df1.iloc[:, -1]) - pd.to_datetime(df1.iloc[:, 0])).dt.days

Upvotes: 0

4212extra
4212extra

Reputation: 81

After some tries I got it:

First convert string to list.

df['created_at'] = df['created_at'].str.split(
    ',').apply(lambda s: list(s))

Then subtract max and min date item by converting to list

df['created_at'] = df['created_at'].apply(lambda s: (datetime.strptime(
    str(max(s)), '%Y-%m-%d') - datetime.strptime(str(min(s)), '%Y-%m-%d') ).days)

Upvotes: 0

Related Questions