Reputation: 1106
i am currently writing a "Split - Apply - Combine" pipeline for my data analysis, which also involves dates. Here's some sample data:
In [1]:
import pandas as pd
import numpy as np
import datetime as dt
startdate = np.datetime64("2018-01-01")
randdates = np.random.randint(1, 365, 100) + startdate
df = pd.DataFrame({'Type': np.random.choice(['A', 'B', 'C'], 100),
'Metric': np.random.rand(100),
'Date': randdates})
df.head()
Out[1]:
Type Metric Date
0 A 0.442970 2018-08-02
1 A 0.611648 2018-02-11
2 B 0.202763 2018-03-16
3 A 0.295577 2018-01-09
4 A 0.895391 2018-11-11
Now I want to aggregate by 'Type'
and get summary statistics for the respective variables. This is easy for numerical variables like 'Metric'
:
df.groupby('Type')['Metric'].agg(('mean', 'std'))
For datetime objects however, calculating a mean, standard deviation, or other statistics doesn't really make sense and throws an error. The context I need this operation for, is that I am modelling a Date based on some distance metric. When I repeat this modelling with random sampling (monte-carlo simulation), I later want to reassign a mean and confidence interval to the modeled dates.
So my Question is: What useful statistics can be built with datetime data? How do you represent the statistical distribution of modelled dates? And how do you implement the aggregation operation?
My Ideal output would be to get a Date_mean
and Date_stdev
column representing a range for my modeled dates.
Upvotes: 2
Views: 2055
Reputation: 1918
You can compute min, max, and mean using the built-in operations of the datetime:
date = dt.datetime.date
df.groupby('Type')['Date'].agg(lambda x:(date(x.mean()), date(x.min()), date(x.max())))
Out[490]:
Type
A (2018-06-10, 2018-01-11, 2018-11-08)
B (2018-05-20, 2018-01-20, 2018-12-31)
C (2018-06-22, 2018-01-04, 2018-12-05)
Name: Date, dtype: object
I used date(x) to make sure the output fits here, it's not really needed.
Upvotes: 0
Reputation: 46351
You can use timestamps (Unix)
Epoch, also known as Unix timestamps, is the number of seconds (not milliseconds!) that have elapsed since January 1, 1970 at 00:00:00 GMT (1970-01-01 00:00:00 GMT).
You can convert all your dates to timestamps liks this:
import time
import datetime
d = "2018-08-02"
time.mktime(datetime.datetime.strptime(d, "%Y-%m-%d").timetuple()) #1533160800
And from there you can calculate what you need.
Upvotes: 1