Reputation: 25
I am trying to do some time trend analysis of cyclone data but am having difficulties getting the pd dataframe into a format where I can do that...
Right now the dataframe looks like this:
date | season | stormid | |
---|---|---|---|
0 | 1970-01-04 12:00:00 | 1970 | SH071970 |
1 | 1970-01-04 18:00:00 | 1970 | SH071970 |
2 | 1970-01-05 00:00:00 | 1970 | SH071970 |
3 | 1970-01-05 06:00:00 | 1970 | SH071970 |
4 | 1970-01-05 12:00:00 | 1970 | SH071970 |
5 | 1970-01-05 18:00:00 | 1970 | SH071970 |
6 | 1970-01-06 00:00:00 | 1970 | SH071970 |
7 | 1970-01-06 06:00:00 | 1970 | SH071970 |
8 | 1970-01-06 12:00:00 | 1970 | SH071970 |
9 | 1970-01-06 18:00:00 | 1970 | SH071970 |
10 | 1970-01-07 00:00:00 | 1970 | SH071970 |
And I would like to have all the entries across days somehow combined into a total day count, for example like this:
start date | end date | days | season | stormid | |
---|---|---|---|---|---|
0 | 1970-01-04 | 1970-01-11 | 15 | 1970 | SH071870 |
So on and so forth for all the different stormids.
Upvotes: 1
Views: 77
Reputation: 261685
If you want to get the number of days between the min and max per group, use numpy.ptp
and Timedelta.days
:
import numpy as np
df['date'] = pd.to_datetime(df['date'])
df['date'] = pd.to_datetime(df['date'])
out = (df.groupby(['season', 'stormid'], as_index=False)
.agg(days=('date', lambda s: np.ptp(s).days),
start_date=('date', lambda s: s.min().normalize()),
end_date=('date', lambda s: s.max().normalize())
)
)
NB. lambda s: np.ptp(s).days
is equivalent to lambda s: (s.max()-s.min()).days
? If you want to round up use lambda s: (s.max()-s.min()).ceil('D').days
(getting 3 days here).
Output:
season stormid days start_date end_date
0 1970 SH071970 2 1970-01-04 1970-01-07
df['date'] = pd.to_datetime(df['date'])
out = (df.groupby(['season', 'stormid'], as_index=False)
.agg(start_date=('date', lambda s: s.min().normalize()),
end_date=('date', lambda s: s.max().normalize())
)
.assign(days=lambda d: d['end_date'].sub(d['start_date']).dt.days)
)
Output:
season stormid start_date end_date days
0 1970 SH071970 1970-01-04 1970-01-07 3
If you simply want the number of rows per group (which is not really "days"):
out = (df.groupby(['season', 'stormid'], as_index=False)
.agg(days=('date', 'size'))
)
Output:
season stormid days
0 1970 SH071970 11
Upvotes: 1