Reputation: 788
This actually is a follow up question to my answer on here. I have the original raw data set show below and converting it to the embedded form shown as Converted Data.
Two grouping blocks below do the job but each seem to be suffering from different bottlenecks and I'm trying to speed them up further, if possible. The unique TEST ID count in df_fails is ~20% of the total rows and in df_tests_and_fails this ratio for Reg No is ~30%.
For the df_fails.groupby block, even if I replace the loop body with a 'pass', the time spent remains pretty much the same. The actual groupby call seems to be taking 2/3 of the total running time of 10s and ['Failure ID'] uses 1/3.
groups = df_fails.groupby(by=['TEST ID'])['Failure ID']
group_list = list()
for name, group in groups:
group_list.append({'TEST ID': name, 'Failure IDs': group.to_list()})
This second block is more expensive compared to the above and the total running time is around 60s on my system and the loop body is taking ~95% of that time.
groups = df_tests_and_fails.groupby(by=['Reg No'])
lst = list()
for name, group in groups:
lst.append({'Reg No': name, 'MOTs': list(group.iloc[:, 1:].T.to_dict().values())})
Original Raw Data
Vehicles
[
{
'Reg No': 'AB12 XYZ',
'Make Model': 'Renault Clio',
},
{
'Reg No': 'CD34 KLM',
'Make Model': 'Ford Focus',
}
]
MOTs
[
{
'Reg No': 'AB12 XYZ',
'Test ID': 'AB1234',
'Test Date': '01/01/2020',
'Total Cost': '127.65'
},
{
'Reg No': 'AB12 XYZ',
'Test ID': 'AB5678',
'Test Date': '01/01/2019',
'Total Cost': '175.30'
},
{
'Reg No': 'CD34 KLM',
'Test ID': 'WY5269',
'Test Date': '01/02/2020',
'Total Cost': '250'
},
{
'Reg No': 'CD34 KLM',
'Test ID': 'PW6638',
'Test Date': '03/02/2019',
'Total Cost': '65.20'
}
]
Failures
[
{
'Test ID': 'AB1234',
'Failure ID': 'F01'
},
{
'Test ID': 'AB1234',
'Failure ID': 'F04'
}
{
'Test ID': 'WY5269',
'Failure ID': 'F04'
},
{
'Test ID': 'PW6638',
'Failure ID': 'F47'
},
{
'Test ID': 'PW6638',
'Failure ID': 'F16'
}
]
Converted Data
{
'Reg No': 'AB12 XYZ',
'Make Model': 'Renault Clio',
'MOTs':
[
{
'Test ID': 'AB1234',
'Test Date': '01/01/2020',
'Total Cost': '127.65',
'Failure IDs': ['F01', 'F04']
}
]
}
As I am pretty new to Python and Pandas, I don't know if I'm missing anything obvious or there are quicker ways of doing things? As this runs on a single core VPS and I can't modify the environment, using multiprocessing or other libraries aren't my options.
Upvotes: 0
Views: 55
Reputation: 13820
This solution uses no reshaping at all, just groupby agg.
df_veh = pd.DataFrame.from_records(...)
df_mot = pd.DataFrame.from_records(...)
df_fail = pd.DataFrame.from_records(...)
df_fail_groups = df_fail.groupby('Test ID')['Failure ID'].apply(lambda s: s.tolist())
df_fail_groups
Test ID
AB1234 [F01, F04]
PW6638 [F47, F16]
WY5269 [F04]
Name: Failure ID, dtype: object
df_mot_fail = df_mot.merge(df_fail_groups, how='left', on='Test ID')
df_mot_fail
Reg No Test ID Test Date Total Cost Failure ID
0 AB12 XYZ AB1234 01/01/2020 127.65 [F01, F04]
1 AB12 XYZ AB5678 01/01/2019 175.30 NaN
2 CD34 KLM WY5269 01/02/2020 250 [F04]
3 CD34 KLM PW6638 03/02/2019 65.20 [F47, F16]
df_mot_fail_grouped = df_mot_fail.set_index('Reg No').groupby('Reg No').apply(lambda s: s.to_dict('records')).rename('MOTs')
df_mot_fail_grouped
Reg No
AB12 XYZ [{'Test ID': 'AB1234', 'Test Date': '01/01/202...
CD34 KLM [{'Test ID': 'WY5269', 'Test Date': '01/02/202...
Name: MOTs, dtype: object
records = df_veh.merge(df_mot_fail_grouped, how='left', on='Reg No').to_dict('records')
records
[{'Reg No': 'AB12 XYZ',
'Make Model': 'Renault Clio',
'MOTs': [{'Test ID': 'AB1234',
'Test Date': '01/01/2020',
'Total Cost': '127.65',
'Failure ID': ['F01', 'F04']},
{'Test ID': 'AB5678',
'Test Date': '01/01/2019',
'Total Cost': '175.30',
'Failure ID': nan}]},
{'Reg No': 'CD34 KLM',
'Make Model': 'Ford Focus',
'MOTs': [{'Test ID': 'WY5269',
'Test Date': '01/02/2020',
'Total Cost': '250',
'Failure ID': ['F04']},
{'Test ID': 'PW6638',
'Test Date': '03/02/2019',
'Total Cost': '65.20',
'Failure ID': ['F47', 'F16']}]}]
Upvotes: 1