Khozzy
Khozzy

Reputation: 1103

Optimizing `apply` in Python Pandas

I have created a function responsible for parsing JSON object, extracting useful fields and creating a Pandas data frame.

def parse_metrics_to_df(metrics):
    def extract_details(row):
        row['trial'] = row['agent']['trial']
        row['numerosity'] = row['agent']['numerosity']
        row['reliable'] = row['agent']['reliable']
        row['was_correct'] = row['performance']['was_correct']
        return row

    df = pd.DataFrame(metrics)
    df = df.apply(extract_details, axis=1)
    df.drop(['agent', 'environment', 'performance'], axis=1, inplace=True)
    df.set_index('trial', inplace=True)

    return df

The metrics is an array of JSON documents looking similar to (first two rows):

[{'agent': {'fitness': 25.2375,
   'numerosity': 1,
   'population': 1,
   'reliable': 0,
   'steps': 1,
   'total_steps': 1,
   'trial': 0},
  'environment': None,
  'performance': {'was_correct': True}},
 {'agent': {'fitness': 23.975625,
   'numerosity': 1,
   'population': 1,
   'reliable': 0,
   'steps': 1,
   'total_steps': 2,
   'trial': 1},
  'environment': None,
  'performance': {'was_correct': False}}]

Then executed as follows:

df = parse_metrics_to_df(metrics)

enter image description here

The code works as expected but it's extremely slow. Parsing array with a million objects takes nearly 1 hour.

How to do this properly?

Upvotes: 1

Views: 1618

Answers (2)

jpp
jpp

Reputation: 164613

You should see a significant (~9x for me) improvement by using a simple list comprehension.

Often pd.DataFrame has overheads which can be avoided by performing manipulations before the data is put in the dataframe.

def parse_metrics_to_df(metrics):
    def extract_details(row):
        row['trial'] = row['agent']['trial']
        row['numerosity'] = row['agent']['numerosity']
        row['reliable'] = row['agent']['reliable']
        row['was_correct'] = row['performance']['was_correct']
        return row

    df = pd.DataFrame(metrics)
    df = df.apply(extract_details, axis=1)
    df.drop(['agent', 'environment', 'performance'], axis=1, inplace=True)
    df.set_index('trial', inplace=True)

    return df


def jp(metrics):

    lst = [[d['agent']['trial'], d['agent']['numerosity'], d['agent']['reliable'],
            d['performance']['was_correct']] for d in metrics]

    df = pd.DataFrame(lst, columns=['trial', 'agent', 'environment', 'performance'])
    df = df.set_index('trial')

    return df

%timeit parse_metrics_to_df(metrics)   # 14.4 ms
%timeit jp(metrics)                    # 1.6 ms

Upvotes: 1

Minje Jeon
Minje Jeon

Reputation: 131

Manipulating Series object is the bottleneck. Creating new Series from dict can be much more faster.

Setup

import pandas as pd

metrics = [{'agent': {'fitness': 25.2375,
   'numerosity': 1,
   'population': 1,
   'reliable': 0,
   'steps': 1,
   'total_steps': 1,
   'trial': 0},
  'environment': None,
  'performance': {'was_correct': True}},
 {'agent': {'fitness': 23.975625,
   'numerosity': 1,
   'population': 1,
   'reliable': 0,
   'steps': 1,
   'total_steps': 2,
   'trial': 1},
  'environment': None,
  'performance': {'was_correct': False}}]
thousand_metrics = metrics * 1000 

Original code

def parse_metrics_to_df(metrics):
    def extract_details(row):
        row['trial'] = row['agent']['trial']
        row['numerosity'] = row['agent']['numerosity']
        row['reliable'] = row['agent']['reliable']
        row['was_correct'] = row['performance']['was_correct']
        return row

    df = pd.DataFrame(metrics)
    df = df.apply(extract_details, axis=1)
    df.drop(['agent', 'environment', 'performance'], axis=1, inplace=True)
    df.set_index('trial', inplace=True)

    return df

%timeit df = parse_metrics_to_df(thousand_metrics)

# 4.06 s ± 87.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Modified code

def parse_metrics_to_df2(metrics):
    def extract_details(row):
        res = {}
        res['trial'] = row['agent']['trial']
        res['numerosity'] = row['agent']['numerosity']
        res['reliable'] = row['agent']['reliable']
        res['was_correct'] = row['performance']['was_correct']
        return pd.Series(res)

    df = pd.DataFrame(metrics)
    df = df.apply(extract_details, axis=1)
    df.set_index('trial', inplace=True)

    return df

df = parse_metrics_to_df2(thousand_metrics)
df2 = parse_metrics_to_df2(thousand_metrics)
df.equals(df2) # True

%timeit df2 = parse_metrics_to_df2(thousand_metrics)

# 566 ms ± 7.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Now, 7x faster.

Upvotes: 1

Related Questions