Fastest way to iterate subsets of rows in pandas dataframe based on condition

Question

I have a dataframe with a column of IDs and a column of values - each ID is associated with two or more rows. When any subset of values within a given ID meets my condition (in this case, a subset of values nets to zero), I want to label those rows with a new ID that consists of the original ID as well as an appended number to uniquely identify the subsets.

Code I've tried so far:

import pandas as pd

d = {'ID': ['2016/01 100','2016/01 100','2016/01 100','2016/01 100','2016/01 200',\
'2016/01 200','2016/01 200','2016/01 200'], 'Value': [-343.68, 343.68, -55.2, 55.2,\
-158.77, 158.77, 123.3, -123.3]}

df = pd.DataFrame(data=d)

df['Cumulative_Sum'] = round(df.Value.cumsum(),2)

print(df)

current_ID = df.loc[0, 'ID']
sum_counter = 1
counter = 0
for row in df.index.values:
    if (df.loc[row, 'Cumulative_Sum'] == 0):
        df.loc[counter:row, 'New_ID'] = str(df.loc[row, 'ID']) + "_" + str(sum_counter)
        counter = row + 1
        sum_counter = sum_counter + 1
        if (counter < len(df.index.values)):    
            if (df.loc[counter, 'ID'] != df.loc[row, 'ID']):
                sum_counter = 1

print (df)

This produces the desired result, but is slow when running on hundreds of thousands of lines.

            ID   Value  Cumulative_Sum         New_ID
0  2016/01 100 -343.68         -343.68  2016/01 100_1
1  2016/01 100  343.68            0.00  2016/01 100_1
2  2016/01 100  -55.20          -55.20  2016/01 100_2
3  2016/01 100   55.20            0.00  2016/01 100_2
4  2016/01 200 -158.77         -158.77  2016/01 200_1
5  2016/01 200  158.77            0.00  2016/01 200_1
6  2016/01 200  123.30          123.30  2016/01 200_2
7  2016/01 200 -123.30            0.00  2016/01 200_2

Is there a faster way to do this without looping while still keeping the New IDs in the format I need?

piRSquared · Accepted Answer

I used nested calls to group by where with in I use cumsum to find contiguous groups that end in zero. Finally, using ngroup to get at the label.

Then I use pd.Series.str.cat to attach the result to the old ID.

df.assign(
    New_ID=
    df.ID.str.cat(
        df.groupby('ID').apply(
            lambda d: d.groupby(
                d.Cumulative_Sum.eq(0).iloc[::-1].cumsum(),
                sort=False).ngroup()
        ).add(1).astype(str),
        sep='_'
    )
)

            ID   Value  Cumulative_Sum         New_ID
0  2016/01 100 -343.68         -343.68  2016/01 100_1
1  2016/01 100  343.68            0.00  2016/01 100_1
2  2016/01 100  -55.20          -55.20  2016/01 100_2
3  2016/01 100   55.20            0.00  2016/01 100_2
4  2016/01 200 -158.77         -158.77  2016/01 200_1
5  2016/01 200  158.77            0.00  2016/01 200_1
6  2016/01 200  123.30          123.30  2016/01 200_2
7  2016/01 200 -123.30            0.00  2016/01 200_2

Fastest way to iterate subsets of rows in pandas dataframe based on condition

Answers (1)

Related Questions