Reputation: 43476
I have a list of dicts, something like this:
test_data = [
{ 'offset':0, 'data':1500 },
{ 'offset':1270, 'data':120 },
{ 'offset':2117, 'data':30 },
{ 'offset':4055, 'data':30000 },
]
The dict items are sorted in the list according to the 'offset'
data. The real data could be much longer.
What I want to do is lookup an item in the list given a particular offset value, which is not exactly one of those values, but in that range. So, a binary search is what I want to do.
I am now aware of the Python bisect
module, which is a ready-made binary search—great, but not directly usable for this case. I'm just wondering what is the easiest way to adapt bisect
to my needs. Here is what I came up with:
import bisect
class dict_list_index_get_member(object):
def __init__(self, dict_list, member):
self.dict_list = dict_list
self.member = member
def __getitem__(self, index):
return self.dict_list[index][self.member]
def __len__(self):
return self.dict_list.__len__()
test_data_index_get_offset = dict_list_index_get_member(test_data, 'offset')
print bisect.bisect(test_data_index_get_offset, 1900)
It prints:
2
My question is, is this the best way to do what I want, or is there some other simpler, better way?
Upvotes: 15
Views: 7448
Reputation: 1365
For range queries over a list of dicts, ducks will perform well. It's as fast as a binary search because it builds a tree-based index.
pip install ducks
from ducks import Dex
test_data = [
{ 'offset':0, 'data':1500 },
{ 'offset':1270, 'data':120 },
{ 'offset':2117, 'data':30 },
{ 'offset':4055, 'data':30000 },
]
# build index on 'offset'
dex = Dex(test_data, ['offset'])
dex[{'offset': {'>': 1900}}]
# result: [{'offset': 2117, 'data': 30}, {'offset': 4055, 'data': 30000}]
Ducks can also search by multiple attributes, e.g.:
# build a Dex on 'offset' and 'data'
dex = Dex(test_data, ['offset', 'data'])
dex[{'offset': {'>': 1900}, 'data': {'<': 50}}]
# result: [{'offset': 2117, 'data': 30}]
Upvotes: 0
Reputation: 7164
Starting in Python 3.10 you can pass a key function as a keyword argument to the bisect functions
>>> bisect.bisect(test_data, 1900, key=lambda x: x["offset"])
2
Upvotes: 1
Reputation: 436
tuples work with bisect if you are ok using them instead...
import bisect
offset = 0
data = 1
test_data = [
(0, 1500),
(1270, 120),
(2117, 30),
(4055, 30000),
]
i = bisect.bisect(test_data, (1900,0))
test_data.insert(i, (1900,0))
print(test_data[i][data])
although because tuples are compared "lexicographically" (left to right) until an element is not equal to the other - you would have to consider if this is desired behaviour
>>> bisect.insort(test_data, (2117,29))
>>> print(test_data)
[(0, 1500), (1270, 120), (2117, 29), (2117, 30), (4055, 30000)]
Upvotes: 0
Reputation: 8729
You could also use one of Python's many SortedDict implementations to manage your test_data. A sorted dict sorts the elements by key and maintains a mapping to a value. Some implementations also support a bisect operation on the keys. For example, the Python sortedcontainers module has a SortedDict that meets your requirements.
In your case it would look something like:
from sortedcontainers import SortedDict
offset_map = SortedDict((item['offset'], item['data']) for item in test_data)
index = offset_map.bisect(1275)
key = offset_map.iloc[index]
print offset_map[key]
# 120
The SortedDict type has a bisect function which returns the bisected index of the desired key. With that index, you can lookup the actual key. And with that key you can get the value.
All of these operations are very fast in sortedcontainers which is also conveniently implemented in pure-Python. There's a performance comparison too which discusses other choices and has benchmark data.
Upvotes: 9
Reputation: 392010
What you can do is this
class OffsetWithAttributes( object ):
def __init__( self, offset, **kw ):
self.offset= offset
self.attributes= kw
def __eq__( self, other ):
return self.offset == other.offset
def __lt__( self, other ):
return self.offset < other.offset
def __le__( self, other ):
return self.offset <= other.offset
def __gt__( self, other ):
return self.offset > other.offset
def __ge__( self, other ):
return self.offset >= other.offset
def __ne__( self, other ):
return self.offset != other.offset
This should allow you to create a simple list
of OffsetWithAttributes
instances. The bisect
algorithm should be perfectly happy to use the defined operators.
You can use your someOWA.attributes['data']
.
Or
def __getattr__( self, key ):
return self.attributes[key]
That should make the OffsetWithAttributes
more like a dict
.
Upvotes: 4
Reputation: 101386
When you say the real data could be much longer, does that prevent you from keeping a list of offset values on hand?
offset_values = [i['offset'] for i in test_data]
bisect.bisect(offset_values, 1900)
Your method seems fine to me though.
Upvotes: 5
Reputation: 15029
The usual pattern here is similar to sorting by an attribute, decorate, operate, and undecorate. So in this case you'd just need to decorate and then call. However you'd want to avoid doing this since decorate would be O(n) whereas you want this to be O(logn). Therefore I'd consider your method best.
Upvotes: 4