Reputation: 1103
I have created a function responsible for parsing JSON object, extracting useful fields and creating a Pandas data frame.
def parse_metrics_to_df(metrics):
def extract_details(row):
row['trial'] = row['agent']['trial']
row['numerosity'] = row['agent']['numerosity']
row['reliable'] = row['agent']['reliable']
row['was_correct'] = row['performance']['was_correct']
return row
df = pd.DataFrame(metrics)
df = df.apply(extract_details, axis=1)
df.drop(['agent', 'environment', 'performance'], axis=1, inplace=True)
df.set_index('trial', inplace=True)
return df
The metrics
is an array of JSON documents looking similar to (first two rows):
[{'agent': {'fitness': 25.2375,
'numerosity': 1,
'population': 1,
'reliable': 0,
'steps': 1,
'total_steps': 1,
'trial': 0},
'environment': None,
'performance': {'was_correct': True}},
{'agent': {'fitness': 23.975625,
'numerosity': 1,
'population': 1,
'reliable': 0,
'steps': 1,
'total_steps': 2,
'trial': 1},
'environment': None,
'performance': {'was_correct': False}}]
Then executed as follows:
df = parse_metrics_to_df(metrics)
The code works as expected but it's extremely slow. Parsing array with a million objects takes nearly 1 hour.
How to do this properly?
Upvotes: 1
Views: 1618
Reputation: 164613
You should see a significant (~9x for me) improvement by using a simple list comprehension.
Often pd.DataFrame
has overheads which can be avoided by performing manipulations before the data is put in the dataframe.
def parse_metrics_to_df(metrics):
def extract_details(row):
row['trial'] = row['agent']['trial']
row['numerosity'] = row['agent']['numerosity']
row['reliable'] = row['agent']['reliable']
row['was_correct'] = row['performance']['was_correct']
return row
df = pd.DataFrame(metrics)
df = df.apply(extract_details, axis=1)
df.drop(['agent', 'environment', 'performance'], axis=1, inplace=True)
df.set_index('trial', inplace=True)
return df
def jp(metrics):
lst = [[d['agent']['trial'], d['agent']['numerosity'], d['agent']['reliable'],
d['performance']['was_correct']] for d in metrics]
df = pd.DataFrame(lst, columns=['trial', 'agent', 'environment', 'performance'])
df = df.set_index('trial')
return df
%timeit parse_metrics_to_df(metrics) # 14.4 ms
%timeit jp(metrics) # 1.6 ms
Upvotes: 1
Reputation: 131
Manipulating Series
object is the bottleneck. Creating new Series
from dict
can be much more faster.
import pandas as pd
metrics = [{'agent': {'fitness': 25.2375,
'numerosity': 1,
'population': 1,
'reliable': 0,
'steps': 1,
'total_steps': 1,
'trial': 0},
'environment': None,
'performance': {'was_correct': True}},
{'agent': {'fitness': 23.975625,
'numerosity': 1,
'population': 1,
'reliable': 0,
'steps': 1,
'total_steps': 2,
'trial': 1},
'environment': None,
'performance': {'was_correct': False}}]
thousand_metrics = metrics * 1000
def parse_metrics_to_df(metrics):
def extract_details(row):
row['trial'] = row['agent']['trial']
row['numerosity'] = row['agent']['numerosity']
row['reliable'] = row['agent']['reliable']
row['was_correct'] = row['performance']['was_correct']
return row
df = pd.DataFrame(metrics)
df = df.apply(extract_details, axis=1)
df.drop(['agent', 'environment', 'performance'], axis=1, inplace=True)
df.set_index('trial', inplace=True)
return df
%timeit df = parse_metrics_to_df(thousand_metrics)
# 4.06 s ± 87.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
def parse_metrics_to_df2(metrics):
def extract_details(row):
res = {}
res['trial'] = row['agent']['trial']
res['numerosity'] = row['agent']['numerosity']
res['reliable'] = row['agent']['reliable']
res['was_correct'] = row['performance']['was_correct']
return pd.Series(res)
df = pd.DataFrame(metrics)
df = df.apply(extract_details, axis=1)
df.set_index('trial', inplace=True)
return df
df = parse_metrics_to_df2(thousand_metrics)
df2 = parse_metrics_to_df2(thousand_metrics)
df.equals(df2) # True
%timeit df2 = parse_metrics_to_df2(thousand_metrics)
# 566 ms ± 7.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Now, 7x faster.
Upvotes: 1