Reputation: 57
I have two datasets:
I want to get the number of messages per locomotive that had a delivery time more than or equal to 300 seconds.
In order to calculate this, I wrote the following code (dataframe 1 is 'df_loco', dataframe 2 is 'df_sent'):
for i in range(0, len(df_loco['LOCO']),1):
for j in range(0,len(df_sent['HM_ID_HM']),1):
if (df_sent.iloc[j,1] == df_loco.iloc[i,0]) and (df_sent.iloc[j,11]>=a):
df_loco.iloc[i,4] += 1
The code does the job and gives me the correct number of messages that suffered delays. However, my dataset is big (dataframe 1 is 300+ rows, dataframe 2 is 55.000+) so it takes a long time to execute it.
I have been able to get rid of nested for loops in parts of my code which made it 100x more efficient. But here I'm having issues implementing the 'and' operator while not using 'for' loops. It returns me an error stating that the code is too ambiguous(?).
Question: Is there a way to improve this code, not having to work with nested loops?
Upvotes: 0
Views: 93
Reputation: 9857
Using pandas you could merge the 2 dataframes with a left join, filter for DELIVERY_TIME > 300 and then use groupby to get a count for LOCO_ID.
import pandas as pd
df_loco = pd.DataFrame({'LOCO_ID':[123, 456, 789, 321]})
df_sent = pd.DataFrame({'MESSAGE_ID':range(1, 7),'LOCO_ID':[456, 123, 123, 321, 789, 123], 'DELIVERY_TIME':[14,800, 420,310,60, 14]})
merged_dataframe = df_loco.merge(df_sent, how='left')
count_dataframe = merged_dataframe[merged_dataframe['DELIVERY_TIME']>300].groupby('LOCO_ID')['LOCO_ID'].count()
print(count_dataframe)
Upvotes: 1