dev33582
dev33582

Reputation: 23

Optimizing python dataframe iteration loop

I have two dataframes, df and ct

df: UNIT, START, END, CHECK are relevant columns. CHECK values are all defaulted to 'n'. My code tests them and if they pass changes this value to 'y'

df.shape = 59155, 5
UNIT DTHours START END CHECK
0 A7000 0.1097 43100.0321 43382.1417 n
1 A7000 0.0125 43105.6621 43382.6736 n
2 A7000 0.2042 43200.1214 43388.3465 n
3 A7000 0.1521 43389.2979 43854.4500 n
4 A7002 0.1455 44198.5049 44199.0009 n

ct: UNIT, START, END are unique combinations, but each UNIT can have multiple entries (these are effectively start/end dates). On average each UNIT row repeats ~5 times

ct.shape = 6219, 3
UNIT START END
0 A7000 43101 43156
1 A7000 43101 43173
2 A7000 43172 43616
3 A7000 43172 43712
4 A7002 43172 44196

I am testing if the values df['START'] and df['END '] are in between any of the instances of that UNIT in ct. However the code is taking exceptionally long. It has iterated through ~12000 rows in 12 hours. What am I doing wrong?

Pseudocode:

 for loop iterating through all rows of df
      create list of all cf[START] values for current df[UNIT]
      create list of all cf[END] values for current df[UNIT]

      test if the df[CHECK] value is 'n' viz default value
           for loop iterating over length of cf[START] (len = cf[END]), index
                test if (df[START]>= cf[START][index]) AND (df[END]>= cf[END][index])
                     change df[CHECK] to 'y'
                     break for loop iterating over cf[START] for this row in df

Actual code:

for index, row in df.iterrows():
    display.clear_output(wait=True)
    display.display(plt.gcf())
    
    df_unit = row['UNIT']
    ct_start = ct.loc[ct['UNIT'] == df_unit]['START'].values
    ct_end = ct.loc[ct['UNIT'] == df_unit]['END'].values
    
    if (row['UNIT'] != 'y'):
        for ct_index in range(len(ct_start)):
            if ((row['START'] >= ct_start[ct_index]) & (row['END'] <= 
ct_end[ct_index])):
                row['CHECK'] == 'y'
                aaa += 1
                break
        
    plt.scatter(index,aaa, c='r')

Expected Output for the dataframes above

UNIT DTHours START END CHECK
0 A7000 0.1097 43100.0321 43382.1417 n
1 A7000 0.0125 43105.6621 43382.6736 n
2 A7000 0.2042 43200.1214 43388.3465 y
3 A7000 0.1521 43389.2979 43854.4500 n
4 A7002 0.1455 44198.5049 44199.0009 n

Upvotes: 1

Views: 57

Answers (2)

Shubham Sharma
Shubham Sharma

Reputation: 71689

We can approach the problem using the following steps

  • Reset the index of df and merge it with ct on UNIT
  • Query the resulting merged dataframe to find the indices where the corresponding values in START and END satisfies the condition of inclusion
  • Using boolean indexing with loc update the values in CHECK column to y using the indices obtained in previous step
s  = df.reset_index().merge(ct, on='UNIT', suffixes=['', '_r'])
ix = s.query("START >= START_r and END <= END_r")['index']
df.loc[df.index.isin(ix), 'CHECK'] = 'y'

    UNIT  DTHours       START         END CHECK
0  A7000   0.1097  43100.0321  43382.1417     n
1  A7000   0.0125  43105.6621  43382.6736     n
2  A7000   0.2042  43200.1214  43388.3465     y
3  A7000   0.1521  43389.2979  43854.4500     n
4  A7002   0.1455  44198.5049  44199.0009     n

Upvotes: 0

qmeeus
qmeeus

Reputation: 2402

As a rule of thumb: 90% of the time if you are using a loop on a dataframe, you are doing something wrong

In your case:

  • rather than doing for ... if, it might be a good idea to filter the dataframe: df[df["CHECK"] == "y"]
  • you are constantly filtering the ct dataframe, for EACH row. You are better off grouping this dataframe once and transforming that into a dictionary that you can access by unit:
    • units = dict(list(ct.groupby("UNIT")))
    • for index in units[df_unit]: do_something()

This should already considerably improve the performance

Upvotes: 1

Related Questions