Use "dict of dicts" or "list of dicts" for storing data from CSV in python?

Question

What is the better option in Python?

A dict with 10000 keys, each containing a list with 10 items
A dict with 10000 keys, each containing a dict with 10 "sub-keys"
A dict with the 10 "sub-keys", each containing a dict with the 10000 keys
Using pandas library (as suggested by @RomanPerekhrest in the comments)

Option 2 seems a bit more programmer-friendly than option 1 (working with e.g. mydict['long-ID1']['street'] rather than mydict['long-ID1'][3]).

However, I am afraid that this might cause much unneeded overhead. I don't expect the number or order of sub-keys (like 'street') to change in the future).

I am looking for the "best" option in terms of performance (speed of lookup), while also considering storage space (in RAM and when saving with pickle).

Background

I am parsing a ~ 4MB CSV file with ~10000 lines (stations) with these columns:

ID - unique ~30 character string
name, street, city, ... - strings
lat,long - GPS coordinates
date - take a guess
jsonstring - some nested dicts

I want to import the data into python as a dict station using ID as the key to allow fast lookups station['some-id']. I will then perform several million lookups in the dict, usually only looking at 1-2 of the 10 columns for each station, depending on use case.

The latter is why, while writing this question, I thought of option 3... the downside I see is that the 10000 keys are much longer than the 10 keys, so repeating that large dict 10 times is probably not such a good idea in terms of memory?

** Update ** Based on @Giova's answer, I put together this performance comparison of options 1 and 2, which you also find on repl.it here:

from timeit import timeit

def listops(n:int, l):
    for i in range(n):
        t = l[i]
    return l

def dictops(n:int, d):
    for i in range(n):
        t = d["nice_key_%1d"%i]
    return d


n= 10
l = []
for i in range(n):
    l.append(i)
d = dict()
for i in range(n):
    d["nice_key_%1d"%i] = None
t1=timeit(lambda: listops(n, l), number=1000000)
t2=timeit(lambda: dictops(n, d), number=1000000)
print("list:",t1)
print("dict:",t2)

Note this times only the speed of lookup, in contrast to Giova's code snippet which also looks at creation of the data structure, which I am not interested in.

Results:

list: 4.312716690998059
dict: 14.279126501001883

I guess the simplest for answering the "storage space" part of my question is actually implementing both and checking for size of the pickle files?

Use "dict of dicts" or "list of dicts" for storing data from CSV in python?

Answers (1)

Related Questions

Use &quot;dict of dicts&quot; or &quot;list of dicts&quot; for storing data from CSV in python?

Answers (1)

Related Questions

Use "dict of dicts" or "list of dicts" for storing data from CSV in python?