Johnny
Johnny

Reputation: 69

How to find difference in months between 2 columns and save it in a new column?

I have a big data frame (the fragment is below):

                              start_date                         finish_date
2842    2019-02-16 19:35:55.125766+00:00    2019-06-23 08:10:42.867492+00:00
2844    2019-05-29 18:03:54.230822+00:00    2019-06-05 08:06:37.896891+00:00
2846    2019-03-26 10:29:14.626280+00:00    2019-03-28 03:00:12.350836+00:00
2847    2019-04-22 16:29:30.480639+00:00    2019-04-24 18:02:09.869749+00:00
2852    2019-06-28 11:32:32.104132+00:00    2019-07-07 20:15:47.000026+00:00
2853    2019-03-21 17:20:50.030024+00:00    2019-03-27 03:18:26.652882+00:00
2854    2019-07-12 13:46:24.119986+00:00    2019-09-16 14:36:16.995393+00:00

start_date and finish_date are datetime64 format.

I need to create a new column with the result of calculation of how many months between start_date and finish_date.

for each string I used

len(pd.date_range(start=df.loc[2844, 'start_date'], end=df.loc[2844, 'finish_date'], freq='M'))

But I dont know how to apply this to every row ... row by row. I guess some lambda must be used...

This:

df['length'] = pd.date_range(start=df['start_date'], end=df['finish_date'], freq='M')

rises an error...

expected result:

                              start_date                         finish_date    length
2842    2019-02-16 19:35:55.125766+00:00    2019-06-23 08:10:42.867492+00:00    4
2844    2019-05-29 18:03:54.230822+00:00    2019-06-05 08:06:37.896891+00:00    1
2846    2019-03-26 10:29:14.626280+00:00    2019-03-28 03:00:12.350836+00:00    0
2847    2019-04-22 16:29:30.480639+00:00    2019-04-24 18:02:09.869749+00:00    0
2852    2019-06-28 11:32:32.104132+00:00    2019-07-07 20:15:47.000026+00:00    1
2853    2019-03-21 17:20:50.030024+00:00    2019-03-27 03:18:26.652882+00:00    0
2854    2019-07-12 13:46:24.119986+00:00    2019-09-16 14:36:16.995393+00:00    2

Upvotes: 1

Views: 368

Answers (3)

Anurag Dabas
Anurag Dabas

Reputation: 24304

Since both dates are of dtype datetime you can calculate the difference between months by using Series.dt.month attribute:

df['length']=(df['finish_date'].dt.month-df['start_date'].dt.month).abs()

Upvotes: 2

Abhi
Abhi

Reputation: 1005

Finding a Difference between dates for month can cause issue with rounding so I have given both the result for your uderstanding:

import pandas as pd


data = {
    "start_date" :["2019-02-16 19:35:55.125766+00:00", "2019-05-29 18:03:54.230822+00:00", "2019-03-26 10:29:14.626280+00:00", "2019-04-22 16:29:30.480639+00:00", "2019-06-28 11:32:32.104132+00:00", "2019-03-21 17:20:50.030024+00:00", "2019-07-12 13:46:24.119986+00:00"],
    "finish_date" : ["2019-06-23 08:10:42.867492+00:00", "2019-06-05 08:06:37.896891+00:00", "2019-03-28 03:00:12.350836+00:00", "2019-04-24 18:02:09.869749+00:00", "2019-07-07 20:15:47.000026+00:00", "2019-03-27 03:18:26.652882+00:00", "2019-09-16 14:36:16.995393+00:00"]
}

df = pd.DataFrame(data)

# you can skip this if already in datetime format
df['start_date'] = pd.to_datetime(df['start_date'])
df['finish_date'] = pd.to_datetime(df['finish_date'])


df["months"] = df.finish_date.dt.to_period('M').astype(int) - df.start_date.dt.to_period('M').astype(int)
df["months_no_rounding"] = df.finish_date.dt.to_period('M') - df.start_date.dt.to_period('M')
print(df)

Result:

enter image description here

Upvotes: 2

mozway
mozway

Reputation: 260335

Be careful as your difference has issues with rounding! For instance, the difference between 2019-05-29 and 2019-06-05 is only 6 days, but you calculate it as 1 month. This might be wanted… or not!

Here is an alternative to approach a real difference in months (an absolute difference in months is not possible as months are variable, from 29 to 31):

df['diff'] = (df['finish_date']-df['start_date']).dt.days//30

output:

                     start_date                finish_date  diff
2842 2019-02-16 19:35:55.125766 2019-06-23 08:10:42.867492     4
2844 2019-05-29 18:03:54.230822 2019-06-05 08:06:37.896891     0
2846 2019-03-26 10:29:14.626280 2019-03-28 03:00:12.350836     0
2847 2019-04-22 16:29:30.480639 2019-04-24 18:02:09.869749     0
2852 2019-06-28 11:32:32.104132 2019-07-07 20:15:47.000026     0
2853 2019-03-21 17:20:50.030024 2019-03-27 03:18:26.652882     0
2854 2019-07-12 13:46:24.119986 2019-09-16 14:36:16.995393     2

and as float: df['diff'] = (df['finish_date']-df['start_date']).dt.days/30

                     start_date                finish_date      diff
2842 2019-02-16 19:35:55.125766 2019-06-23 08:10:42.867492  4.200000
2844 2019-05-29 18:03:54.230822 2019-06-05 08:06:37.896891  0.200000
2846 2019-03-26 10:29:14.626280 2019-03-28 03:00:12.350836  0.033333
2847 2019-04-22 16:29:30.480639 2019-04-24 18:02:09.869749  0.066667
2852 2019-06-28 11:32:32.104132 2019-07-07 20:15:47.000026  0.300000
2853 2019-03-21 17:20:50.030024 2019-03-27 03:18:26.652882  0.166667
2854 2019-07-12 13:46:24.119986 2019-09-16 14:36:16.995393  2.200000

Upvotes: 0

Related Questions