Reputation: 335
I have this dataframe:
import pandas as pd
df = pd.DataFrame({'Flight Day': ['2018-10-01', '2018-10-02','2018-10-03', '2018-10-04', '2018-10-05','2018-10-06', '2018-10-07', '2018-10-08', '2018-10-09','2018-10-10','2018-10-11','2018-10-12'],
'Flight Number': ['CA1336', 'CA1332', 'CA1336', 'CA1473', 'CA1336', 'CA1331', 'CA1666', 'CA1336', 'CA1336', 'CA1336', 'CA1336', 'CA1667'],
'STD Departure': [10, 15, 10, 15,10, 15, 15, 15,10, 10, 10, 11],
'Sandwich 1': [2, 4, 8, 4,3, 2, 3, 1,5, 5, 2, 1],
'Sandwich 2': [2, 4, 8, 4,2, 2, 3, 4,2, 5, 2, 1]})
I first would like to keep only the 5 most recent days for each flight with the specific departure time. I used this formula so far:
df = df.groupby(['Flight Number','STD Departure']).tail(5)
Then I would like to delete the flights with the highest consumption (Sandwich + plus Sandwic 2) and the lowest consumption, again grouped first by Flight Number and STD Departure.
I tried this code but it did not bring the desired results:
FF = ["Sandwich 1", "Sandwich 2"]
df ["sum"] = df[FF].sum(axis=1)
df = df.groupby(['Flight Number','STD Departure', 'sum']).head(4)
df = df.groupby(['Flight Number','STD Departure', 'sum']).tail(3)
Any idea how I could get to my desired outcome:
Flight Day Flight Number STD Departure Sandwich 1 Sandwich 2 sum
1 2018-10-02 CA1332 15 4 4 8
3 2018-10-04 CA1473 15 4 4 8
4 2018-10-05 CA1336 10 3 2 5
5 2018-10-06 CA1331 15 2 2 4
6 2018-10-07 CA1666 15 3 3 6
7 2018-10-08 CA1336 15 1 4 5
8 2018-10-09 CA1336 10 5 2 7
9 2018-10-10 CA1336 10 5 5 10
11 2018-10-12 CA1667 11 1 1 2
These lines were delted in the last step:
10 2018-10-11 CA1336 10 2 2 4
2 2018-10-03 CA1336 10 8 8 16
Upvotes: 0
Views: 606
Reputation: 335
I figured it out on my own, I had forgotten to sort first.
df= df.sort_values (by = ["Flight Day", "Flight Number",'STD Departure'])
df = df.groupby(['Flight Number','STD Departure']).tail(5)
FF = ["Sandwich 1", "Sandwich 2"]
df ["sum"] = df[FF].sum(axis=1)
df= df.sort_values (by = ["Flight Number", "STD Departure", "sum"])
df = df.groupby(['Flight Number','STD Departure']).tail(4)
df = df.groupby(['Flight Number','STD Departure']).head(3)
Upvotes: 0
Reputation: 862681
I believe you need remove top and bottow row only if length of groups is more like 2 rows - first sorting by 3 columns and then remove per groups with iloc
and if-else
statement:
FF = ["Sandwich 1", "Sandwich 2"]
df["sum"] = df[FF].sum(axis=1)
df = df.groupby(['Flight Number','STD Departure']).head(5)
df = (df.sort_values(['Flight Number','sum','Flight Day'], ascending=[False, False, True])
.groupby(['Flight Number','STD Departure'], group_keys=False)
.apply(lambda x: x.iloc[1:-1] if len(x) > 2 else x)
.sort_index()
)
print (df)
Flight Day Flight Number STD Departure Sandwich 1 Sandwich 2 sum
1 2018-10-02 CA1332 15 4 4 8
3 2018-10-04 CA1473 15 4 4 8
4 2018-10-05 CA1336 10 3 2 5
5 2018-10-06 CA1331 15 2 2 4
6 2018-10-07 CA1666 15 3 3 6
7 2018-10-08 CA1336 15 1 4 5
8 2018-10-09 CA1336 10 5 2 7
9 2018-10-10 CA1336 10 5 5 10
11 2018-10-12 CA1667 11 1 1 2
Upvotes: 2