Which is quicker at fetching a result given an search in python. A list of dicts or a pandas dataframe?

Question

I import, from a json file, the following object:

"loan_numbers":[
    {"symbol":1000114, "val":0.1},
    {"symbol":1000150, "val":0.15},
    {"symbol":1000074, "val":0.11}
]

The above is a list of dicts.

My question is this. If i wanted to search for "symbol" (eg.1000150) and return "val" (eg. 0.15) which method would be quicker:

[method 1] iterating through the list of dicts (for i in loan_numbers, if i['symbol']==1000150, v= val)

or

[method 2] populate a pandas data frame and search (df.loc[df['symbol'] == 1000150, 'val'])

I was going to test both side-by-side, but was wondering if there is an accepted pythonic method or if one method was considerably faster than another under certain conditions (for example, i have a feeling that for longer lists the DataFrame would be faster because of its types).

I have done some searches, on stack overflow and also google, which show both as viable, but not which "is preferred" or why.

arhr · Accepted Answer

Considering both of your approaches I ran a simulation on both functions for list sizes up to 2^26 (~67 million records). At every search, a symbol is randomly chosen from the list using random.choice(), then we search back this symbol to return the associated val. Be careful, based on my understanding you would create the pandas.DataFrame specifically to search through it, this is why I included the creation of the pandas.DataFrame in the function, in other words it is also timed.

The following results are as follows:

Based on this quick analysis, you would need to have a list with more than 100 million records to start using pandas over a simple loop.

The code used is:

import random

import perfplot
import numpy as np
import pandas as pd

def standard_iteration(loan_dict):
     random_symbol = random.choice(loan_dict['loan_numbers'])['symbol']
     for loan in loan_dict['loan_numbers']:
          if loan['symbol'] == random_symbol:
               return loan['val']

def dataframe_filtering(loan_dict):
    random_symbol = random.choice(loan_dict['loan_numbers'])['symbol']
    df = pd.DataFrame(loan_dict['loan_numbers'])
    return df[df['symbol'] == random_symbol]['val'].iloc[0]

perfplot.show(
    setup=lambda n: {'loan_numbers': [{'symbol': i, 'val': j} for i, j in enumerate(np.random.rand(n))]},  # or setup=np.random.rand
    kernels=[
        standard_iteration,
        dataframe_filtering,
    ],
    labels=["Standard iteration", "Dataframe filtering"],
    n_range=[2 ** k for k in range(26)],
    xlabel="len(a)",
    equality_check=None
)

Which is quicker at fetching a result given an search in python. A list of dicts or a pandas dataframe?

Answers (1)

Related Questions