Reputation: 23
I have two dataframes, df and ct
df: UNIT, START, END, CHECK
are relevant columns. CHECK values are all defaulted to 'n'. My code tests them and if they pass changes this value to 'y'
df.shape = 59155, 5
UNIT | DTHours | START | END | CHECK | |
---|---|---|---|---|---|
0 | A7000 | 0.1097 | 43100.0321 | 43382.1417 | n |
1 | A7000 | 0.0125 | 43105.6621 | 43382.6736 | n |
2 | A7000 | 0.2042 | 43200.1214 | 43388.3465 | n |
3 | A7000 | 0.1521 | 43389.2979 | 43854.4500 | n |
4 | A7002 | 0.1455 | 44198.5049 | 44199.0009 | n |
ct: UNIT, START, END
are unique combinations, but each UNIT can have multiple entries (these are effectively start/end dates). On average each UNIT row repeats ~5 times
ct.shape = 6219, 3
UNIT | START | END | |
---|---|---|---|
0 | A7000 | 43101 | 43156 |
1 | A7000 | 43101 | 43173 |
2 | A7000 | 43172 | 43616 |
3 | A7000 | 43172 | 43712 |
4 | A7002 | 43172 | 44196 |
I am testing if the values df['START'] and df['END '] are in between any of the instances of that UNIT in ct. However the code is taking exceptionally long. It has iterated through ~12000 rows in 12 hours. What am I doing wrong?
Pseudocode:
for loop iterating through all rows of df
create list of all cf[START] values for current df[UNIT]
create list of all cf[END] values for current df[UNIT]
test if the df[CHECK] value is 'n' viz default value
for loop iterating over length of cf[START] (len = cf[END]), index
test if (df[START]>= cf[START][index]) AND (df[END]>= cf[END][index])
change df[CHECK] to 'y'
break for loop iterating over cf[START] for this row in df
Actual code:
for index, row in df.iterrows():
display.clear_output(wait=True)
display.display(plt.gcf())
df_unit = row['UNIT']
ct_start = ct.loc[ct['UNIT'] == df_unit]['START'].values
ct_end = ct.loc[ct['UNIT'] == df_unit]['END'].values
if (row['UNIT'] != 'y'):
for ct_index in range(len(ct_start)):
if ((row['START'] >= ct_start[ct_index]) & (row['END'] <=
ct_end[ct_index])):
row['CHECK'] == 'y'
aaa += 1
break
plt.scatter(index,aaa, c='r')
Expected Output for the dataframes above
UNIT | DTHours | START | END | CHECK | |
---|---|---|---|---|---|
0 | A7000 | 0.1097 | 43100.0321 | 43382.1417 | n |
1 | A7000 | 0.0125 | 43105.6621 | 43382.6736 | n |
2 | A7000 | 0.2042 | 43200.1214 | 43388.3465 | y |
3 | A7000 | 0.1521 | 43389.2979 | 43854.4500 | n |
4 | A7002 | 0.1455 | 44198.5049 | 44199.0009 | n |
Upvotes: 1
Views: 57
Reputation: 71689
We can approach the problem using the following steps
df
and merge
it with ct
on UNIT
START
and END
satisfies the condition of inclusionloc
update the values in CHECK
column to y
using the indices obtained in previous steps = df.reset_index().merge(ct, on='UNIT', suffixes=['', '_r'])
ix = s.query("START >= START_r and END <= END_r")['index']
df.loc[df.index.isin(ix), 'CHECK'] = 'y'
UNIT DTHours START END CHECK
0 A7000 0.1097 43100.0321 43382.1417 n
1 A7000 0.0125 43105.6621 43382.6736 n
2 A7000 0.2042 43200.1214 43388.3465 y
3 A7000 0.1521 43389.2979 43854.4500 n
4 A7002 0.1455 44198.5049 44199.0009 n
Upvotes: 0
Reputation: 2402
As a rule of thumb: 90% of the time if you are using a loop on a dataframe, you are doing something wrong
In your case:
for ... if
, it might be a good idea to filter the dataframe: df[df["CHECK"] == "y"]
ct
dataframe, for EACH row. You are better off grouping this dataframe once and transforming that into a dictionary that you can access by unit:
units = dict(list(ct.groupby("UNIT")))
for index in units[df_unit]: do_something()
This should already considerably improve the performance
Upvotes: 1