Reputation: 87
I have generated a table that shows the mode values of my dataset. The dataset was originally grouped by "date", "hour" and "room" in order to be able to get the mode value of foot traffic. The groupby was performed as follows:
dataframe = df.groupby([df['date'], df['hour'], df['room']])
Then I generated the mode value(s) for "traffic" of each groups the following way:
dataframe = dataframe['traffic'].apply(lambda x: x.mode())
As a result I have my dataframe which displays the proper groups and shows the modal value per room, per hour and per day. My issue is that in certain cases the number of modal values is more than one (as 2 or 3 values have had the same number of observations)
The current dataframe looks like this:
mode
date hour room
6 12 room1 0 15
room2 0 23
1 26
room3 0 1
1 2
13 room2 0 9
1 11
room2 0 15
As you can see above, for room2 at 12:00 on January 6 there are 2 modal values (23 and 26).
My issue here is that ideally I would drop the lowest "mode" value(s) from each group where there are more than 1 observation.
I have looked at several approaches but cannot get this to function. I was thinking that the following would work:
dataframe.apply(lambda x: x[dataframe['mode'] == dataframe['mode'].max()])
I would then remove duplicates, but this does not affect the dataframe..
or
dataframe.filter(lambda x : x[dataframe['mode'] == dataframe['mode'].max()], dataframe['mode'])
which gives me a "'function' object is not iterable" error
or
for elem in range(0, dataframe.size -1): #to iterate over the daaframe rows
if elem != dataframe['mode'].max(): #to identify rows that aren't max mode value
dataframe = dataframe.drop([elem]) #to drop these rows
To answer the request from Conner, please see below the original csv data (dataframe called "df"):
room time capacity areaName hour date traffic
0 room1 Mon Nov 02 09:00:00 40 area01 9 2 14
1 room1 Mon Nov 02 09:05:00 40 area01 9 2 15
2 room1 Mon Nov 02 09:10:00 80 area01 9 2 23
3 room1 Mon Nov 02 09:15:00 80 area01 9 2 23
...
14 room2 Mon Nov 02 11:00:00 40 area03 11 2 67
15 room2 Mon Nov 02 11:50:00 80 area03 11 2 64
16 room2 Mon Nov 02 11:10:00 40 area03 11 2 72
If anyone knew a way to go through each group and only keep the max mode value if there are several ones I would greatly appreciate.
Thank you for your time!
-Romain
Upvotes: 2
Views: 897
Reputation: 7275
I was looking for something like this. FYI you can get this with df.head(n=10).to_csv(path, index=False)
room,time,capacity,areaName,hour,date,traffic
room1,Mon Nov 02 09:00:00,40,area01,9,2,14
room1,Mon Nov 02 09:05:00,40,area01,9,2,15
room1,Mon Nov 02 09:10:00,80,area01,9,2,23
room1,Mon Nov 02 09:15:00,80,area01,9,2,23
room2,Mon Nov 02 11:00:00,40,area03,11,2,67
room2,Mon Nov 02 11:50:00,80,area03,11,2,64
room2,Mon Nov 02 11:10:00,40,area03,11,2,72
(Below I use equivalent code to be more concise)
This gives you a groupby
object
df = df.groupby(['date', 'hour', 'room'])
It turns out, unlike
mean
,max
,median
,min
andmad
there is nomode
method forGroupBy
objects!
Once you've done this
df = dataframe['traffic'].apply(lambda x: x.mode())
You can reset the index and regroup to apply the max
per group:
df = df.reset_index()
df = df.groupby(['date', 'hour', 'room']).max()
Upvotes: 2