darren
darren

Reputation: 5774

Which is quicker at fetching a result given an search in python. A list of dicts or a pandas dataframe?

I import, from a json file, the following object:

"loan_numbers":[
    {"symbol":1000114, "val":0.1},
    {"symbol":1000150, "val":0.15},
    {"symbol":1000074, "val":0.11}
]    

The above is a list of dicts.

My question is this. If i wanted to search for "symbol" (eg.1000150) and return "val" (eg. 0.15) which method would be quicker:

or

I was going to test both side-by-side, but was wondering if there is an accepted pythonic method or if one method was considerably faster than another under certain conditions (for example, i have a feeling that for longer lists the DataFrame would be faster because of its types).

I have done some searches, on stack overflow and also google, which show both as viable, but not which "is preferred" or why.

Upvotes: 3

Views: 982

Answers (1)

arhr
arhr

Reputation: 1591

Considering both of your approaches I ran a simulation on both functions for list sizes up to 2^26 (~67 million records). At every search, a symbol is randomly chosen from the list using random.choice(), then we search back this symbol to return the associated val. Be careful, based on my understanding you would create the pandas.DataFrame specifically to search through it, this is why I included the creation of the pandas.DataFrame in the function, in other words it is also timed.

The following results are as follows: perfplot results

Based on this quick analysis, you would need to have a list with more than 100 million records to start using pandas over a simple loop.

The code used is:

import random

import perfplot
import numpy as np
import pandas as pd

def standard_iteration(loan_dict):
     random_symbol = random.choice(loan_dict['loan_numbers'])['symbol']
     for loan in loan_dict['loan_numbers']:
          if loan['symbol'] == random_symbol:
               return loan['val']

def dataframe_filtering(loan_dict):
    random_symbol = random.choice(loan_dict['loan_numbers'])['symbol']
    df = pd.DataFrame(loan_dict['loan_numbers'])
    return df[df['symbol'] == random_symbol]['val'].iloc[0]

perfplot.show(
    setup=lambda n: {'loan_numbers': [{'symbol': i, 'val': j} for i, j in enumerate(np.random.rand(n))]},  # or setup=np.random.rand
    kernels=[
        standard_iteration,
        dataframe_filtering,
    ],
    labels=["Standard iteration", "Dataframe filtering"],
    n_range=[2 ** k for k in range(26)],
    xlabel="len(a)",
    equality_check=None
)

Upvotes: 1

Related Questions