abdus_salam
abdus_salam

Reputation: 788

How to reshape Pandas DataFrames Effectively

This actually is a follow up question to my answer on here. I have the original raw data set show below and converting it to the embedded form shown as Converted Data.

Two grouping blocks below do the job but each seem to be suffering from different bottlenecks and I'm trying to speed them up further, if possible. The unique TEST ID count in df_fails is ~20% of the total rows and in df_tests_and_fails this ratio for Reg No is ~30%.

For the df_fails.groupby block, even if I replace the loop body with a 'pass', the time spent remains pretty much the same. The actual groupby call seems to be taking 2/3 of the total running time of 10s and ['Failure ID'] uses 1/3.

groups = df_fails.groupby(by=['TEST ID'])['Failure ID']
group_list = list()
for name, group in groups:
    group_list.append({'TEST ID': name, 'Failure IDs': group.to_list()})

This second block is more expensive compared to the above and the total running time is around 60s on my system and the loop body is taking ~95% of that time.

groups = df_tests_and_fails.groupby(by=['Reg No'])
lst = list()
for name, group in groups:
    lst.append({'Reg No': name, 'MOTs': list(group.iloc[:, 1:].T.to_dict().values())})

Original Raw Data

Vehicles
[
    {
        'Reg No': 'AB12 XYZ',
        'Make Model': 'Renault Clio',
    },
    {
        'Reg No': 'CD34 KLM',
        'Make Model': 'Ford Focus',
    }
]

MOTs
[
        {
            'Reg No': 'AB12 XYZ',   
            'Test ID': 'AB1234',
            'Test Date': '01/01/2020',
            'Total Cost': '127.65'
        },
        {
            'Reg No': 'AB12 XYZ',   
            'Test ID': 'AB5678',
            'Test Date': '01/01/2019',
            'Total Cost': '175.30'
        },
        {
            'Reg No': 'CD34 KLM',   
            'Test ID': 'WY5269',
            'Test Date': '01/02/2020',
            'Total Cost': '250'
        },
        {
            'Reg No': 'CD34 KLM',   
            'Test ID': 'PW6638',
            'Test Date': '03/02/2019',
            'Total Cost': '65.20'
    }       
]

Failures
[
        {
            'Test ID':  'AB1234',
            'Failure ID': 'F01'
        },
        {
            'Test ID':  'AB1234',
            'Failure ID': 'F04'
        }   
        {
            'Test ID':  'WY5269',
            'Failure ID': 'F04'
        },
        {
            'Test ID':  'PW6638',
            'Failure ID': 'F47'
        },
        {
            'Test ID':  'PW6638',
            'Failure ID': 'F16'
        } 
]

Converted Data

{
    'Reg No': 'AB12 XYZ',
    'Make Model': 'Renault Clio',
    'MOTs':
    [
        {
            'Test ID': 'AB1234',
            'Test Date': '01/01/2020',
            'Total Cost': '127.65',
            'Failure IDs': ['F01', 'F04']
        }   
    ]
}

As I am pretty new to Python and Pandas, I don't know if I'm missing anything obvious or there are quicker ways of doing things? As this runs on a single core VPS and I can't modify the environment, using multiprocessing or other libraries aren't my options.

Upvotes: 0

Views: 55

Answers (1)

BallpointBen
BallpointBen

Reputation: 13820

This solution uses no reshaping at all, just groupby agg.

df_veh = pd.DataFrame.from_records(...)
df_mot = pd.DataFrame.from_records(...)
df_fail = pd.DataFrame.from_records(...)

df_fail_groups = df_fail.groupby('Test ID')['Failure ID'].apply(lambda s: s.tolist())
df_fail_groups
Test ID
AB1234    [F01, F04]
PW6638    [F47, F16]
WY5269         [F04]
Name: Failure ID, dtype: object

df_mot_fail = df_mot.merge(df_fail_groups, how='left', on='Test ID')
df_mot_fail
     Reg No Test ID   Test Date Total Cost  Failure ID
0  AB12 XYZ  AB1234  01/01/2020     127.65  [F01, F04]
1  AB12 XYZ  AB5678  01/01/2019     175.30         NaN
2  CD34 KLM  WY5269  01/02/2020        250       [F04]
3  CD34 KLM  PW6638  03/02/2019      65.20  [F47, F16]

df_mot_fail_grouped = df_mot_fail.set_index('Reg No').groupby('Reg No').apply(lambda s: s.to_dict('records')).rename('MOTs')
df_mot_fail_grouped
Reg No
AB12 XYZ    [{'Test ID': 'AB1234', 'Test Date': '01/01/202...
CD34 KLM    [{'Test ID': 'WY5269', 'Test Date': '01/02/202...
Name: MOTs, dtype: object

records = df_veh.merge(df_mot_fail_grouped, how='left', on='Reg No').to_dict('records')
records
[{'Reg No': 'AB12 XYZ',
  'Make Model': 'Renault Clio',
  'MOTs': [{'Test ID': 'AB1234',
    'Test Date': '01/01/2020',
    'Total Cost': '127.65',
    'Failure ID': ['F01', 'F04']},
   {'Test ID': 'AB5678',
    'Test Date': '01/01/2019',
    'Total Cost': '175.30',
    'Failure ID': nan}]},
 {'Reg No': 'CD34 KLM',
  'Make Model': 'Ford Focus',
  'MOTs': [{'Test ID': 'WY5269',
    'Test Date': '01/02/2020',
    'Total Cost': '250',
    'Failure ID': ['F04']},
   {'Test ID': 'PW6638',
    'Test Date': '03/02/2019',
    'Total Cost': '65.20',
    'Failure ID': ['F47', 'F16']}]}]

Upvotes: 1

Related Questions