flurble
flurble

Reputation: 1106

Python / Pandas - Datetime statistics. How to aggregate means of datetime columns

i am currently writing a "Split - Apply - Combine" pipeline for my data analysis, which also involves dates. Here's some sample data:

In [1]:
import pandas as pd
import numpy as np
import datetime as dt

startdate = np.datetime64("2018-01-01")
randdates = np.random.randint(1, 365, 100) + startdate

df = pd.DataFrame({'Type': np.random.choice(['A', 'B', 'C'], 100), 
                   'Metric': np.random.rand(100), 
                   'Date': randdates})
df.head()

Out[1]: 
  Type    Metric       Date
0    A  0.442970 2018-08-02
1    A  0.611648 2018-02-11
2    B  0.202763 2018-03-16
3    A  0.295577 2018-01-09
4    A  0.895391 2018-11-11

Now I want to aggregate by 'Type' and get summary statistics for the respective variables. This is easy for numerical variables like 'Metric':

df.groupby('Type')['Metric'].agg(('mean', 'std'))

For datetime objects however, calculating a mean, standard deviation, or other statistics doesn't really make sense and throws an error. The context I need this operation for, is that I am modelling a Date based on some distance metric. When I repeat this modelling with random sampling (monte-carlo simulation), I later want to reassign a mean and confidence interval to the modeled dates.

So my Question is: What useful statistics can be built with datetime data? How do you represent the statistical distribution of modelled dates? And how do you implement the aggregation operation?

My Ideal output would be to get a Date_mean and Date_stdev column representing a range for my modeled dates.

Upvotes: 2

Views: 2055

Answers (2)

David Makovoz
David Makovoz

Reputation: 1918

You can compute min, max, and mean using the built-in operations of the datetime:

    date = dt.datetime.date
    df.groupby('Type')['Date'].agg(lambda x:(date(x.mean()), date(x.min()), date(x.max())))

    Out[490]:
    Type
    A    (2018-06-10, 2018-01-11, 2018-11-08)
    B    (2018-05-20, 2018-01-20, 2018-12-31)
    C    (2018-06-22, 2018-01-04, 2018-12-05)
Name: Date, dtype: object

I used date(x) to make sure the output fits here, it's not really needed.

Upvotes: 0

prosti
prosti

Reputation: 46351

You can use timestamps (Unix)

Epoch, also known as Unix timestamps, is the number of seconds (not milliseconds!) that have elapsed since January 1, 1970 at 00:00:00 GMT (1970-01-01 00:00:00 GMT).

You can convert all your dates to timestamps liks this:

import time
import datetime
d = "2018-08-02"
time.mktime(datetime.datetime.strptime(d, "%Y-%m-%d").timetuple()) #1533160800

And from there you can calculate what you need.

Upvotes: 1

Related Questions