Jonas Heidelberg
Jonas Heidelberg

Reputation: 5024

Use "dict of dicts" or "list of dicts" for storing data from CSV in python?

What is the better option in Python?

  1. A dict with 10000 keys, each containing a list with 10 items
  2. A dict with 10000 keys, each containing a dict with 10 "sub-keys"
  3. A dict with the 10 "sub-keys", each containing a dict with the 10000 keys
  4. Using pandas library (as suggested by @RomanPerekhrest in the comments)

Option 2 seems a bit more programmer-friendly than option 1 (working with e.g. mydict['long-ID1']['street'] rather than mydict['long-ID1'][3]).

However, I am afraid that this might cause much unneeded overhead. I don't expect the number or order of sub-keys (like 'street') to change in the future).

I am looking for the "best" option in terms of performance (speed of lookup), while also considering storage space (in RAM and when saving with pickle).

Background

I am parsing a ~ 4MB CSV file with ~10000 lines (stations) with these columns:

ID - unique ~30 character string
name, street, city, ... - strings
lat,long - GPS coordinates
date - take a guess
jsonstring - some nested dicts

I want to import the data into python as a dict station using ID as the key to allow fast lookups station['some-id']. I will then perform several million lookups in the dict, usually only looking at 1-2 of the 10 columns for each station, depending on use case.

The latter is why, while writing this question, I thought of option 3... the downside I see is that the 10000 keys are much longer than the 10 keys, so repeating that large dict 10 times is probably not such a good idea in terms of memory?

** Update ** Based on @Giova's answer, I put together this performance comparison of options 1 and 2, which you also find on repl.it here:

from timeit import timeit

def listops(n:int, l):
    for i in range(n):
        t = l[i]
    return l

def dictops(n:int, d):
    for i in range(n):
        t = d["nice_key_%1d"%i]
    return d


n= 10
l = []
for i in range(n):
    l.append(i)
d = dict()
for i in range(n):
    d["nice_key_%1d"%i] = None
t1=timeit(lambda: listops(n, l), number=1000000)
t2=timeit(lambda: dictops(n, d), number=1000000)
print("list:",t1)
print("dict:",t2)

Note this times only the speed of lookup, in contrast to Giova's code snippet which also looks at creation of the data structure, which I am not interested in.

Results:

list: 4.312716690998059
dict: 14.279126501001883

I guess the simplest for answering the "storage space" part of my question is actually implementing both and checking for size of the pickle files?

Upvotes: 0

Views: 1632

Answers (1)

Giova
Giova

Reputation: 2005

Between the option you mentioned, the ones using dictionaries (2, 3) are better in terms access speed. Which one is better between 2 and 3 really depends on how you prefere to retrieve elements.

Consider also you can associate to each row a progressive number as key to the first dictionary, by operating that way, reasonably supposing string consumes much memory than small numbers, that trick would waste less memory.

Option 1 should only be considered if speed is not that much important; cause of the significant amount of time spent by list methods. That can be even easily empirically verified with a simple snippet:

from elapsed import TimeIt


def listops(n:int):
    l = []
    for i in range(n):
        l.append(i)
    for i in range(n):
        t = i in l
    return l


def dictops(n:int):
    d = dict()
    for i in range(n):
        d[i] = None
    for i in range(n):
        t = i in d
    return d

TimeIt(lambda: listops(10), 1000000, logger_name=__name__, msg='listops(10)')
TimeIt(lambda: dictops(10), 1000000, logger_name=__name__, msg='dictops(10)')

Here we have two functions, listops and 'dictops', creating respectively a list and a dict of given n integers and then checking for the presence of each inserted integer. The code wants basically checking only construction, insertion, and presence testing for list and dict. This logs out the following timings:

Elapsed time '1000000 times listops(10)':  3.559521 seconds.
Elapsed time '1000000 times dictops(10)':  2.720709 seconds.

[Note that TimeIt (camel case) is a custom class I did for my purposes, but you can easily use timeit from the timeit module.]

Even if is not explicitly part of your question, I would advice using sqlite in-memory database, with indexes on columns used in search query and prepared statement ready to be fired on demand.

Upvotes: 2

Related Questions