G.Peach
G.Peach

Reputation: 69

iterating over multiple columns and rows in pandas dataframe

I am fairly new to Pandas and desperately need help to modify my dataframe by comparing consecutive rows within multiple group pairings:

Dataframe example =

idData idStation idCast Pressure

1   1   1   1505
2   1   1   1506
3   1   1   1507
4   1   1   1508
5   1   2   1505
6   1   2   1506
7   1   2   1503
8   1   2   1504
9   2   1   1505
10  2   1   1506
11  2   1   1507

etc

I want to delete any rows where the Pressure value for that row is less than those above it, for each idStation and idCast pair (i.e. rows where idData = 7 and 8 need to be deleted). I don't want to compare different idStation & idCast pairs - i.e. first iteration of the loop would look through the Pressure record for idStation = 1, idCast = 1 and the second iteration of the loop would look through the record for idStation = 1, idCast = 2 etc). I have tried grouping by idStation and idCast, then looping over them and comparing row by row but this only modifies a copy, not the original dataframe and the changes are lost.

Stn_Cast_Group = Dataframe.groupby(['idStation','idCast'])
for name, group in Stn_Cast_Group:
        j = 0
        for i in range(1,len(group['Pressure'])):                     
            if group['Pressure'].iloc[i] < j:
                group['Pressure'].iloc[i] = np.nan
            else:
                j = group['Pressure'].iloc[i]

This labels the correct pressure values as nan (as I am unsure of how to delete the row) but only for the group view.

How would you create a copy of the dataframe (so as to have access to original and modified versions) and then delete the rows as mentioned above?

Upvotes: 1

Views: 4463

Answers (1)

Stefan
Stefan

Reputation: 42875

You could use boolean indexing with the condition that for each group the Pressure value is not below the .cummax() for that group.

  1. Use .groupby() to work on unique subsets of ['idStation', 'idCast'].
  2. For each of these groups, check if the current 'Pressure' value is below the running max (cummax()), and return the inverse boolean (ie False if this the value is below). The result is returned from groupby() as .values.
  3. Use the result of 2. in boolean indexing to drop the rows that did meet the condition.

In pandas code, this looks like:

df[~df.groupby(['idStation', 'idCast']).apply(lambda x: x.Pressure < x.Pressure.cummax()).values]

to get:

    idData  idStation  idCast  Pressure
0        1          1       1      1505
1        2          1       1      1506
2        3          1       1      1507
3        4          1       1      1508
4        5          1       2      1505
5        6          1       2      1506
8        9          2       1      1505
9       10          2       1      1506
10      11          2       1      1507

You can also eliminate the .apply() step as suggested by @MaxU for a ~40% speed increase for the toy data:

df[df.Pressure >= df.groupby(['idStation','idCast'])['Pressure'].cummax()]

Upvotes: 3

Related Questions