Reputation: 69
I am fairly new to Pandas and desperately need help to modify my dataframe by comparing consecutive rows within multiple group pairings:
Dataframe example =
idData idStation idCast Pressure
1 1 1 1505
2 1 1 1506
3 1 1 1507
4 1 1 1508
5 1 2 1505
6 1 2 1506
7 1 2 1503
8 1 2 1504
9 2 1 1505
10 2 1 1506
11 2 1 1507
etc
I want to delete any rows where the Pressure value for that row is less than those above it, for each idStation and idCast pair (i.e. rows where idData = 7 and 8 need to be deleted). I don't want to compare different idStation & idCast pairs - i.e. first iteration of the loop would look through the Pressure record for idStation = 1, idCast = 1 and the second iteration of the loop would look through the record for idStation = 1, idCast = 2 etc). I have tried grouping by idStation and idCast, then looping over them and comparing row by row but this only modifies a copy, not the original dataframe and the changes are lost.
Stn_Cast_Group = Dataframe.groupby(['idStation','idCast'])
for name, group in Stn_Cast_Group:
j = 0
for i in range(1,len(group['Pressure'])):
if group['Pressure'].iloc[i] < j:
group['Pressure'].iloc[i] = np.nan
else:
j = group['Pressure'].iloc[i]
This labels the correct pressure values as nan (as I am unsure of how to delete the row) but only for the group view.
How would you create a copy of the dataframe (so as to have access to original and modified versions) and then delete the rows as mentioned above?
Upvotes: 1
Views: 4463
Reputation: 42875
You could use boolean indexing
with the condition that for each group the Pressure
value is not below the .cummax()
for that group.
.groupby()
to work on unique subsets of ['idStation', 'idCast']
. 'Pressure'
value is below the running max
(cummax()
), and return the inverse boolean
(ie False
if this the value is below). The result is returned from groupby()
as .values
.2.
in boolean indexing
to drop the rows that did meet the condition.In pandas
code, this looks like:
df[~df.groupby(['idStation', 'idCast']).apply(lambda x: x.Pressure < x.Pressure.cummax()).values]
to get:
idData idStation idCast Pressure
0 1 1 1 1505
1 2 1 1 1506
2 3 1 1 1507
3 4 1 1 1508
4 5 1 2 1505
5 6 1 2 1506
8 9 2 1 1505
9 10 2 1 1506
10 11 2 1 1507
You can also eliminate the .apply()
step as suggested by @MaxU for a ~40% speed increase for the toy data:
df[df.Pressure >= df.groupby(['idStation','idCast'])['Pressure'].cummax()]
Upvotes: 3