Reputation: 5774
I import, from a json file, the following object:
"loan_numbers":[
{"symbol":1000114, "val":0.1},
{"symbol":1000150, "val":0.15},
{"symbol":1000074, "val":0.11}
]
The above is a list of dicts.
My question is this. If i wanted to search for "symbol" (eg.1000150) and return "val" (eg. 0.15) which method would be quicker:
or
I was going to test both side-by-side, but was wondering if there is an accepted pythonic method or if one method was considerably faster than another under certain conditions (for example, i have a feeling that for longer lists the DataFrame would be faster because of its types).
I have done some searches, on stack overflow and also google, which show both as viable, but not which "is preferred" or why.
Upvotes: 3
Views: 982
Reputation: 1591
Considering both of your approaches I ran a simulation on both functions for list sizes up to 2^26 (~67 million records).
At every search, a symbol
is randomly chosen from the list using random.choice()
, then we search back this symbol
to return the associated val
. Be careful, based on my understanding you would create the pandas.DataFrame
specifically to search through it, this is why I included the creation of the pandas.DataFrame
in the function, in other words it is also timed.
The following results are as follows:
Based on this quick analysis, you would need to have a list with more than 100 million records to start using pandas
over a simple loop.
The code used is:
import random
import perfplot
import numpy as np
import pandas as pd
def standard_iteration(loan_dict):
random_symbol = random.choice(loan_dict['loan_numbers'])['symbol']
for loan in loan_dict['loan_numbers']:
if loan['symbol'] == random_symbol:
return loan['val']
def dataframe_filtering(loan_dict):
random_symbol = random.choice(loan_dict['loan_numbers'])['symbol']
df = pd.DataFrame(loan_dict['loan_numbers'])
return df[df['symbol'] == random_symbol]['val'].iloc[0]
perfplot.show(
setup=lambda n: {'loan_numbers': [{'symbol': i, 'val': j} for i, j in enumerate(np.random.rand(n))]}, # or setup=np.random.rand
kernels=[
standard_iteration,
dataframe_filtering,
],
labels=["Standard iteration", "Dataframe filtering"],
n_range=[2 ** k for k in range(26)],
xlabel="len(a)",
equality_check=None
)
Upvotes: 1