Reputation: 5024
What is the better option in Python?
dict
with 10000 keys, each containing a list
with 10 itemsdict
with 10000 keys, each containing a dict
with 10 "sub-keys"dict
with the 10 "sub-keys", each containing a dict
with the 10000 keysOption 2 seems a bit more programmer-friendly than option 1 (working with e.g. mydict['long-ID1']['street']
rather than mydict['long-ID1'][3]
).
However, I am afraid that this might cause much unneeded overhead. I don't expect the number or order of sub-keys (like 'street'
) to change in the future).
I am looking for the "best" option in terms of performance (speed of lookup), while also considering storage space (in RAM and when saving with pickle).
Background
I am parsing a ~ 4MB CSV file with ~10000 lines (stations) with these columns:
ID - unique ~30 character string
name, street, city, ... - strings
lat,long - GPS coordinates
date - take a guess
jsonstring - some nested dicts
I want to import the data into python as a dict station
using ID as the key to allow fast lookups station['some-id']
. I will then perform several million lookups in the dict, usually only looking at 1-2 of the 10 columns for each station, depending on use case.
The latter is why, while writing this question, I thought of option 3... the downside I see is that the 10000 keys are much longer than the 10 keys, so repeating that large dict 10 times is probably not such a good idea in terms of memory?
** Update ** Based on @Giova's answer, I put together this performance comparison of options 1 and 2, which you also find on repl.it here:
from timeit import timeit
def listops(n:int, l):
for i in range(n):
t = l[i]
return l
def dictops(n:int, d):
for i in range(n):
t = d["nice_key_%1d"%i]
return d
n= 10
l = []
for i in range(n):
l.append(i)
d = dict()
for i in range(n):
d["nice_key_%1d"%i] = None
t1=timeit(lambda: listops(n, l), number=1000000)
t2=timeit(lambda: dictops(n, d), number=1000000)
print("list:",t1)
print("dict:",t2)
Note this times only the speed of lookup, in contrast to Giova's code snippet which also looks at creation of the data structure, which I am not interested in.
Results:
list: 4.312716690998059
dict: 14.279126501001883
I guess the simplest for answering the "storage space" part of my question is actually implementing both and checking for size of the pickle files?
Upvotes: 0
Views: 1632
Reputation: 2005
Between the option you mentioned, the ones using dictionaries (2, 3) are better in terms access speed. Which one is better between 2 and 3 really depends on how you prefere to retrieve elements.
Consider also you can associate to each row a progressive number as key to the first dictionary, by operating that way, reasonably supposing string consumes much memory than small numbers, that trick would waste less memory.
Option 1 should only be considered if speed is not that much important; cause of the significant amount of time spent by list methods. That can be even easily empirically verified with a simple snippet:
from elapsed import TimeIt
def listops(n:int):
l = []
for i in range(n):
l.append(i)
for i in range(n):
t = i in l
return l
def dictops(n:int):
d = dict()
for i in range(n):
d[i] = None
for i in range(n):
t = i in d
return d
TimeIt(lambda: listops(10), 1000000, logger_name=__name__, msg='listops(10)')
TimeIt(lambda: dictops(10), 1000000, logger_name=__name__, msg='dictops(10)')
Here we have two functions, listops
and 'dictops', creating respectively a list
and a dict
of given n
integers and then checking for the presence of each inserted integer. The code wants basically checking only construction, insertion, and presence testing for list
and dict
. This logs out the following timings:
Elapsed time '1000000 times listops(10)': 3.559521 seconds.
Elapsed time '1000000 times dictops(10)': 2.720709 seconds.
[Note that TimeIt
(camel case) is a custom class I did for my purposes, but you can easily use timeit
from the timeit
module.]
Even if is not explicitly part of your question, I would advice using sqlite
in-memory database, with indexes on columns used in search query and prepared statement ready to be fired on demand.
Upvotes: 2