Reputation: 684
I have a large python dictionary (65535 key:value pairs), where key is range(0, 65536) and the values are integers.
The solution I found to sorting this data structure is posted here: Sort a Python dictionary by value
That solution works, but is not necessarily very fast.
To further complicate the problem, there is a potential for me to have many (thousands) of these dictionaries that I must combine prior to sorting. I am currently combining these dictionaries by iterating over the pairs in one dictionary, doing a key lookup in the other dictionary, and adding/updating the entry as appropriate.
This makes my question two fold:
1)Is a dictionary the right data structure for this problem? Would a custom tree or something else make more sense?
2)If dictionary is the smart, reasonable choice, what is the ideal way to combine multiples of the dictionary and then sort it?
One solution may be for me to redesign my program's flow in order to decrease the number of dictionaries being maintained to one, though this is more of a last resort.
Thanks
Upvotes: 0
Views: 496
Reputation: 32392
A dictionary populated with 65535 entries with keys from the range (0:65536) sounds suspiciously like an array. If you need sorted arrays, why are you using dictionaries?
Normally, in Python, you would use a list for this type of data. In your case, since the values are integers, you might also want to consider using the array module. You should also have a look at the heapq module since if your data can be represented in this way, there is a builtin merge function that could be used.
In any case, if you need to merge data structures and produce a sorted data structure as a result, it is best to to use a merge algorithm and one possibility for that is a mergesort algorithm.
Upvotes: 2
Reputation: 2609
With that quantity of data, I would bite the bullet and use the built in sqlite module. Yes you give up some python flexibility and have to use SQL, but right now its sorting 65k values; next it will be find values that meet certain criteria. So instead of reinventing relational databases, just go the SQL route now.
Upvotes: 0
Reputation: 57474
There's not enough information here to say which data structure you should use, because we don't know what else you're doing with it.
If you need to be able to quickly insert records into the data structure later one at a time, then you do need a tree-like data structure, which unfortunately doesn't have a standard implementation (or even a standard interface, for some operations) in Python.
If you only need to be able to do what you said--to sort existing data--then you can use lists. The sorting is quick, especially if parts of the data are already sorted, and you can use binary searching for fast lookups. However, inserting elements will be O(n) rather than the O(log n) you'll get with a tree.
Here's a simple example, converting the dicts to a list or tuples, sorting the combined results and using the bisect module to search for items.
Note that you can have duplicate keys, showing up in more than one dict. This is handled easily: they'll be sorted together naturally, and the bisection will give you a [start, end) range containing all of those keys.
If you want to add blocks of data later, append it to the end and re-sort the list; Python's sorting is good at that and it'll probably be much better than O(n log n).
This code assumes your keys are integers, as you said.
dataA = { 1: 'data1', 3: 'data3', 5: 'data5', 2: 'data2' }
dataB = { 2: 'more data2', 4: 'data4', 6: 'data6' }
combined_list = dataA.items() + dataB.items()
combined_list.sort()
print combined_list
import bisect
def get_range(data, value):
lower_bound = bisect.bisect_left(data, (value, ))
upper_bound = bisect.bisect_left(data, (value+1, ))
return lower_bound, upper_bound
lower_bound, upper_bound = get_range(combined_list, 2)
print lower_bound, upper_bound
print combined_list[lower_bound:upper_bound]
Upvotes: 0