RomainD
RomainD

Reputation: 87

Getting the maximum mode per group using groupby

I have generated a table that shows the mode values of my dataset. The dataset was originally grouped by "date", "hour" and "room" in order to be able to get the mode value of foot traffic. The groupby was performed as follows:

dataframe = df.groupby([df['date'], df['hour'], df['room']])

Then I generated the mode value(s) for "traffic" of each groups the following way:

dataframe = dataframe['traffic'].apply(lambda x: x.mode())

As a result I have my dataframe which displays the proper groups and shows the modal value per room, per hour and per day. My issue is that in certain cases the number of modal values is more than one (as 2 or 3 values have had the same number of observations)

The current dataframe looks like this:

                            mode
date    hour    room        
6       12      room1   0   15  
                room2   0   23   
                        1   26  
                room3   0   1  
                        1   2
        13      room2   0   9
                        1   11
                room2   0   15

As you can see above, for room2 at 12:00 on January 6 there are 2 modal values (23 and 26).
My issue here is that ideally I would drop the lowest "mode" value(s) from each group where there are more than 1 observation.

I have looked at several approaches but cannot get this to function. I was thinking that the following would work:

dataframe.apply(lambda x: x[dataframe['mode'] == dataframe['mode'].max()])

I would then remove duplicates, but this does not affect the dataframe..

or

dataframe.filter(lambda x : x[dataframe['mode'] == dataframe['mode'].max()], dataframe['mode'])

which gives me a "'function' object is not iterable" error
or

for elem in range(0, dataframe.size -1):      #to iterate over the daaframe rows
    if elem != dataframe['mode'].max():       #to identify rows that aren't max mode value
        dataframe = dataframe.drop([elem])    #to drop these rows

To answer the request from Conner, please see below the original csv data (dataframe called "df"):

        room    time                 capacity   areaName    hour    date    traffic 
    0   room1   Mon Nov 02 09:00:00  40         area01      9       2       14
    1   room1   Mon Nov 02 09:05:00  40         area01      9       2       15
    2   room1   Mon Nov 02 09:10:00  80         area01      9       2       23  
    3   room1   Mon Nov 02 09:15:00  80         area01      9       2       23  
...
    14  room2   Mon Nov 02 11:00:00  40         area03      11      2       67  
    15  room2   Mon Nov 02 11:50:00  80         area03      11      2       64
    16  room2   Mon Nov 02 11:10:00  40         area03      11      2       72  

If anyone knew a way to go through each group and only keep the max mode value if there are several ones I would greatly appreciate.

Thank you for your time!

-Romain

Upvotes: 2

Views: 897

Answers (1)

conner.xyz
conner.xyz

Reputation: 7275

I was looking for something like this. FYI you can get this with df.head(n=10).to_csv(path, index=False)

room,time,capacity,areaName,hour,date,traffic 
room1,Mon Nov 02 09:00:00,40,area01,9,2,14
room1,Mon Nov 02 09:05:00,40,area01,9,2,15
room1,Mon Nov 02 09:10:00,80,area01,9,2,23  
room1,Mon Nov 02 09:15:00,80,area01,9,2,23  
room2,Mon Nov 02 11:00:00,40,area03,11,2,67  
room2,Mon Nov 02 11:50:00,80,area03,11,2,64
room2,Mon Nov 02 11:10:00,40,area03,11,2,72 

(Below I use equivalent code to be more concise)

This gives you a groupby object

df = df.groupby(['date', 'hour', 'room'])

It turns out, unlike mean, max, median, min and madthere is no mode method for GroupBy objects!

Once you've done this

df = dataframe['traffic'].apply(lambda x: x.mode())

You can reset the index and regroup to apply the max per group:

df = df.reset_index()
df = df.groupby(['date', 'hour', 'room']).max()

Upvotes: 2

Related Questions