MHanu
MHanu

Reputation: 335

Filter out top and bottom values in a dataframe

I have this dataframe:

import pandas as pd
df = pd.DataFrame({'Flight Day': ['2018-10-01', '2018-10-02','2018-10-03', '2018-10-04', '2018-10-05','2018-10-06', '2018-10-07', '2018-10-08', '2018-10-09','2018-10-10','2018-10-11','2018-10-12'], 
               'Flight Number': ['CA1336', 'CA1332', 'CA1336', 'CA1473', 'CA1336', 'CA1331', 'CA1666', 'CA1336', 'CA1336', 'CA1336', 'CA1336', 'CA1667'],
               'STD Departure': [10, 15, 10, 15,10, 15, 15, 15,10, 10, 10, 11], 
               'Sandwich 1': [2, 4, 8, 4,3, 2, 3, 1,5, 5, 2, 1],
               'Sandwich 2': [2, 4, 8, 4,2, 2, 3, 4,2, 5, 2, 1]})

I first would like to keep only the 5 most recent days for each flight with the specific departure time. I used this formula so far:

df = df.groupby(['Flight Number','STD Departure']).tail(5)

Then I would like to delete the flights with the highest consumption (Sandwich + plus Sandwic 2) and the lowest consumption, again grouped first by Flight Number and STD Departure.

I tried this code but it did not bring the desired results:

FF = ["Sandwich 1", "Sandwich 2"]
df ["sum"] = df[FF].sum(axis=1)
df = df.groupby(['Flight Number','STD Departure', 'sum']).head(4)
df = df.groupby(['Flight Number','STD Departure', 'sum']).tail(3)

Any idea how I could get to my desired outcome:

Flight Day Flight Number  STD Departure  Sandwich 1  Sandwich 2  sum
1   2018-10-02        CA1332             15           4           4    8
3   2018-10-04        CA1473             15           4           4    8
4   2018-10-05        CA1336             10           3           2    5
5   2018-10-06        CA1331             15           2           2    4
6   2018-10-07        CA1666             15           3           3    6
7   2018-10-08        CA1336             15           1           4    5
8   2018-10-09        CA1336             10           5           2    7
9   2018-10-10        CA1336             10           5           5   10
11  2018-10-12        CA1667             11           1           1    2

These lines were delted in the last step:

10  2018-10-11        CA1336             10           2           2    4
2   2018-10-03        CA1336             10           8           8   16

Upvotes: 0

Views: 606

Answers (2)

MHanu
MHanu

Reputation: 335

I figured it out on my own, I had forgotten to sort first.

df= df.sort_values (by = ["Flight Day", "Flight Number",'STD Departure'])
df = df.groupby(['Flight Number','STD Departure']).tail(5)
FF = ["Sandwich 1", "Sandwich 2"]
df ["sum"] = df[FF].sum(axis=1)
df= df.sort_values (by = ["Flight Number", "STD Departure", "sum"])
df = df.groupby(['Flight Number','STD Departure']).tail(4)
df = df.groupby(['Flight Number','STD Departure']).head(3)

Upvotes: 0

jezrael
jezrael

Reputation: 862681

I believe you need remove top and bottow row only if length of groups is more like 2 rows - first sorting by 3 columns and then remove per groups with iloc and if-else statement:

FF = ["Sandwich 1", "Sandwich 2"]
df["sum"] = df[FF].sum(axis=1)

df = df.groupby(['Flight Number','STD Departure']).head(5)
df = (df.sort_values(['Flight Number','sum','Flight Day'], ascending=[False, False, True])
        .groupby(['Flight Number','STD Departure'], group_keys=False)
        .apply(lambda x: x.iloc[1:-1] if len(x) > 2 else x)
        .sort_index()
        )
print (df)
    Flight Day Flight Number  STD Departure  Sandwich 1  Sandwich 2  sum
1   2018-10-02        CA1332             15           4           4    8
3   2018-10-04        CA1473             15           4           4    8
4   2018-10-05        CA1336             10           3           2    5
5   2018-10-06        CA1331             15           2           2    4
6   2018-10-07        CA1666             15           3           3    6
7   2018-10-08        CA1336             15           1           4    5
8   2018-10-09        CA1336             10           5           2    7
9   2018-10-10        CA1336             10           5           5   10
11  2018-10-12        CA1667             11           1           1    2

Upvotes: 2

Related Questions