Reputation: 137
Give a dataframe like this:
count date location type
0 100 2018-01-01 site1 high
1 10 2018-01-01 site2 low
2 11 2018-01-01 site3 low
3 101 2018-01-03 site2 high
4 103 2018-01-03 site2 high
5 15 2018-01-03 site3 low
I need to find the highest and lowest count for each day (in mm-dd format, year is irrelevant). The result I'm looking for is something like this:
count date location
month-day type
01-01 high 100 2018-01-01 site1
low 10 2018-01-01 site2
01-03 high 103 2018-01-03 site2
low 15 2018-01-03 site3
I have a way that works, but I'm sure it can be cleaned up. Here's what I've got currently:
df = pd.DataFrame({'date':['2018-01-01', '2018-01-01', '2018-01-01', '2018-01-03', '2018-01-03', '2018-01-03'],
'location':['site1', 'site2', 'site3', 'site2', 'site2', 'site3'],
'type':['high', 'low', 'low', 'high', 'high', 'low'],
'count':[100, 10, 11, 101, 103, 15]})
df['date'] = pd.to_datetime(df['date'])
df['month-day'] = df['date'].apply(lambda x: x.strftime('%m-%d'))
maxCount = df.loc[df.groupby(['month-day']['type'=='high'])['count'].idxmax()]
minCount = df.loc[df.groupby(['month-day']['type'=='low'])['count'].idxmin()]
df = maxCount.merge(minCount, how='outer')
df.set_index(['month-day', 'type'], inplace=True)
df.sort_index(inplace=True)
These will eventually be used as input to matplotlib to chart count vs. month-day for both low and high, so it may actually make sense to keep them separate rather than joining them back together, but is there a better way to do this? The groupby in particular seems kind of iffy with the ][
, but it does work. The only things that I care about are month-day, type, and count (and type is only needed to know if it's low or high, so if I used a dedicated series for low and one for high I wouldn't need to keep type once I put month-day and count in the appropriate series).
Upvotes: 0
Views: 1063
Reputation: 25239
You may try agg
, stack
, loc
and set_index
s = pd.to_datetime(df.date).dt.strftime('%m-%d')
m = df.groupby(s)['count'].agg(['idxmax', 'idxmin']).stack()
df_out = df.loc[m].set_index([m.index.droplevel(1), 'type'])
Out[127]:
date location count
date type
01-01 high 2018-01-01 site1 100
low 2018-01-01 site2 10
01-03 high 2018-01-03 site2 103
low 2018-01-03 site3 15
Upvotes: 0
Reputation: 14093
You were not really clear on the logic: should type
be included? Based on what you have tried I am going to assume yes:
# groupby
group = df.groupby('month-day')['count']
# create your min and max logic for boolean indexing
min_log = ((df['count'] == group.transform(min)) & (df['type'] == 'low'))
max_log = ((df['count'] == group.transform(max)) & (df['type'] == 'high'))
# boolean indexing to filter df
df[ min_log | max_log]
date location type count month-day
0 2018-01-01 site1 high 100 01-01
1 2018-01-01 site2 low 10 01-01
4 2018-01-03 site2 high 103 01-03
5 2018-01-03 site3 low 15 01-03
Upvotes: 0
Reputation: 2049
What you want to do is complicated by the fact that you have already assigned highs and lows. Do you need to account for these? (Is one day's max labelled as a low
?)
If not, you can go with something as simple as this:
df.groupby(['month-day']).agg({ 'count': ['min', 'max'] })
Which would give you this:
count
min max
month-day
01-01 10 100
01-03 15 103
Upvotes: 1