natnay
natnay

Reputation: 490

Loop through grouped data - Python/Pandas

I'm trying to perform an action on grouped data in Pandas. For each group, I want to loop through the rows and compare them to the first row in the group. If conditions are met, then I want to print out the row details. My data looks like this:

Orig  Dest  Route  Vol    Per   VolPct
ORD   ICN   A      2,251  0.64  0.78
ORD   ICN   B      366    0.97  0.13
ORD   ICN   C      142    0.14  0.05
DCA   FRA   A      9,059  0.71  0.85
DCA   FRA   B      1,348  0.92  0.13
DCA   FRA   C      281    0.8   0.03

My groups are Orig, Dest pairs. If a row in the group other than the first row has a Per greater than the first row and a VolPct greater than .1, I want to output the grouped pair and the route. In this example, the output would be:

ORD ICN B
DCA FRA B

My attempted code is as follows:

for lane in otp.groupby(otp['Orig','Dest']):
    X = lane.first(['Per'])
    for row in lane:
        if (row['Per'] > X and row['VolPct'] > .1):
            print(row['Orig','Dest','Route'])

However, this isn't working so I'm obviously not doing something right. I'm also not sure how to tell Python to ignore the first row when in the "row in lane" loop. Any ideas? Thanks!

Upvotes: 0

Views: 2880

Answers (1)

Grr
Grr

Reputation: 16079

You are pretty close as it is.

First, you are calling groupby incorrectly. You should just pass a list of the column names instead of a DataFrame object. So, instead of otp.groupby(otp['Orig','Dest']) you should use otp.groupby(['Orig','Dest']).

Once you are looping through the groups you will hit more issues. A group in a groupby object is actually a tuple. The first item in that tuple is the grouping key and the second is the grouped data. For example your first group would be the following tuple:

(('DCA', 'FRA'),   Orig Dest Route    Vol   Per  VolPct
 3  DCA  FRA     A  9,059  0.71    0.85
 4  DCA  FRA     B  1,348  0.92    0.13
 5  DCA  FRA     C    281  0.80    0.03)

You will need to change the way you set X to reflect this. For example, X = lane.first(['Per']) should become X = lane[1].iloc[0].Per. After that you only have a minor errors in the way you iterate through the rows and access multiple columns in a row. To wrap it all up your loop should be something like so:

for key, lane in otp.groupby(otp['Orig','Dest']):
    X = lane.iloc[0].Per
    for idx, row in lane.iterrows():
        if (row['Per'] > X and row['VolPct'] > .1):
            print(row[['Orig','Dest','Route']])

Note that I use iterrows to iterate through the rows, and I use double brackets when accessing multiple columns in a DataFrame.

You don't really need to tell pandas to ignore the first row in each group as it should never trigger your if statement, but if you did want to skip it you could use lane[1:].iterrows().

Upvotes: 2

Related Questions