jml
jml

Reputation: 137

pandas selecting max and min simultaneously

Give a dataframe like this:

   count        date location  type
0    100  2018-01-01    site1  high
1     10  2018-01-01    site2   low
2     11  2018-01-01    site3   low
3    101  2018-01-03    site2  high
4    103  2018-01-03    site2  high
5     15  2018-01-03    site3   low

I need to find the highest and lowest count for each day (in mm-dd format, year is irrelevant). The result I'm looking for is something like this:

                count       date location
month-day type                           
01-01     high    100 2018-01-01    site1
          low      10 2018-01-01    site2
01-03     high    103 2018-01-03    site2
          low      15 2018-01-03    site3

I have a way that works, but I'm sure it can be cleaned up. Here's what I've got currently:

df = pd.DataFrame({'date':['2018-01-01', '2018-01-01', '2018-01-01', '2018-01-03', '2018-01-03', '2018-01-03'],
                   'location':['site1', 'site2', 'site3', 'site2', 'site2', 'site3'], 
                   'type':['high', 'low', 'low', 'high', 'high', 'low'], 
                   'count':[100, 10, 11, 101, 103, 15]})

df['date'] = pd.to_datetime(df['date'])
df['month-day'] = df['date'].apply(lambda x: x.strftime('%m-%d')) 

maxCount = df.loc[df.groupby(['month-day']['type'=='high'])['count'].idxmax()]
minCount = df.loc[df.groupby(['month-day']['type'=='low'])['count'].idxmin()]

df = maxCount.merge(minCount, how='outer')
df.set_index(['month-day', 'type'], inplace=True)
df.sort_index(inplace=True)

These will eventually be used as input to matplotlib to chart count vs. month-day for both low and high, so it may actually make sense to keep them separate rather than joining them back together, but is there a better way to do this? The groupby in particular seems kind of iffy with the ][, but it does work. The only things that I care about are month-day, type, and count (and type is only needed to know if it's low or high, so if I used a dedicated series for low and one for high I wouldn't need to keep type once I put month-day and count in the appropriate series).

Upvotes: 0

Views: 1063

Answers (3)

Andy L.
Andy L.

Reputation: 25239

You may try agg, stack, loc and set_index

s = pd.to_datetime(df.date).dt.strftime('%m-%d')
m = df.groupby(s)['count'].agg(['idxmax', 'idxmin']).stack()
df_out = df.loc[m].set_index([m.index.droplevel(1), 'type'])

Out[127]:
                  date location  count
date  type
01-01 high  2018-01-01    site1    100
      low   2018-01-01    site2     10
01-03 high  2018-01-03    site2    103
      low   2018-01-03    site3     15

Upvotes: 0

It_is_Chris
It_is_Chris

Reputation: 14093

You were not really clear on the logic: should type be included? Based on what you have tried I am going to assume yes:

# groupby
group = df.groupby('month-day')['count']

# create your min and max logic for boolean indexing
min_log = ((df['count'] == group.transform(min)) & (df['type'] == 'low'))
max_log = ((df['count'] == group.transform(max)) & (df['type'] == 'high'))

# boolean indexing to filter df
df[ min_log | max_log]

        date location  type  count month-day
0 2018-01-01    site1  high    100     01-01
1 2018-01-01    site2   low     10     01-01
4 2018-01-03    site2  high    103     01-03
5 2018-01-03    site3   low     15     01-03

Upvotes: 0

Dave
Dave

Reputation: 2049

What you want to do is complicated by the fact that you have already assigned highs and lows. Do you need to account for these? (Is one day's max labelled as a low?) If not, you can go with something as simple as this:

df.groupby(['month-day']).agg({ 'count': ['min', 'max'] })                                                                                                                                         

Which would give you this:

          count     
            min  max
month-day           
01-01        10  100
01-03        15  103

Upvotes: 1

Related Questions