rkx74656
rkx74656

Reputation: 55

How do I calculate time difference in days or months in python3

I've been working on a scraping and EDA project on Python3 using Pandas, BeautifulSoup, and a few other libraries and wanted to do some analysis using the time differences between two dates. I want to determine the number of days (or months or even years if that'll make it easier) between the start dates and end dates, and am stuck. I have two columns (air start date, air end date), with dates in the following format: MM-YYYY (so like 01-2021). I basically wanted to make a third column with the time difference between the end and start dates (so I could use it in later analysis).

# split air_dates column into start and end date

dateList = df["air_dates"].str.split("-", n = 1, expand = True)
df['air_start_date'] = dateList[0]
df['air_end_date'] = dateList[1]
df.drop(columns = ['air_dates'], inplace = True)
df.drop(columns = ['rank'], inplace = True)

# changing dates to numerical notation
df['air_start_date'] = pds.to_datetime(df['air_start_date'])
df['air_start_date'] = df['air_start_date'].dt.date.apply(lambda x: x.strftime('%m-%Y') if pds.notnull(x) else npy.NaN)

df['air_end_date'] = pds.Series(df['air_end_date'])
df['air_end_date'] = pds.to_datetime(df['air_end_date'], errors = 'coerce')
df['air_end_date'] = df['air_end_date'].dt.date.apply(lambda x: x.strftime('%m-%Y') if pds.notnull(x) else npy.NaN)

df.isnull().sum()
df.dropna(subset = ['air_end_date'], inplace = True)

def time_diff(time_series):
    return datetime.datetime.strptime(time_series, '%d')
df['time difference'] = df['air_end_date'].apply(time_diff) - df['air_start_date'].apply(time_diff)

The last four lines are my attempt at getting a time difference, but I got an error saying 'ValueError: unconverted data remains: -2021'. Any help would be greatly appreciated, as this has had me stuck for a good while now. Thank you!

Upvotes: 1

Views: 186

Answers (3)

FObersteiner
FObersteiner

Reputation: 25544

here's a slightly cleaned up version; subtract start date from end date to get a timedelta, then take the days attribute from that.

EX:

import pandas as pd

df = pd.DataFrame({'air_dates': ["Apr 2009 - Jul 2010", "not a date - also not a date"]})

df['air_start_date'] = df['air_dates'].str.split(" - ", expand=True)[0]
df['air_end_date'] = df['air_dates'].str.split(" - ", expand=True)[1]

df['air_start_date'] = pd.to_datetime(df['air_start_date'], errors="coerce")
df['air_end_date'] = pd.to_datetime(df['air_end_date'], errors="coerce")

df['timediff_days'] = (df['air_end_date']-df['air_start_date']).dt.days

That will give you for the dummy example

df['timediff_days']
0    456.0
1      NaN
Name: timediff_days, dtype: float64

Regarding calculation of difference in month, you can find some suggestions how to calculate those here. I'd go with @piRSquared's approach:

df['timediff_months'] = ((df['air_end_date'].dt.year - df['air_start_date'].dt.year) * 12 +
                         (df['air_end_date'].dt.month - df['air_start_date'].dt.month))

df['timediff_months']
0    15.0
1     NaN
Name: timediff_months, dtype: float64

Upvotes: 0

rkx74656
rkx74656

Reputation: 55

Ok so I figured it out. In my second to last line, I replaced the %d with %m-%Y and now it populates the new column with the number of days between the two dates. I think the format needed to be consistent when running strptime so that's what was causing that error.

Upvotes: 0

Achyut
Achyut

Reputation: 67

As far as I can understand, if you have start date and time and end date and time then you can use datetime module in python.

To use this, something like this would be used:

import datetime

# variable = datetime(year, month, day, hour, minute, second)

start = datetime(2017,5,8,18,56,40)

end = datetime(2019,6,27,12,30,58)

print( start - end )   # this will print the difference of these 2 date and time

Hope this answer helps you.

Upvotes: 2

Related Questions