bpeter
bpeter

Reputation: 255

Data structure for custom indexing

I am looking to write a data structure to represent some genetic data. This data can be represented as a list of size n, where each entry also has a "genetic position" which is a real number between 0 and 1. To make nomenclature clear, I'll call the position in the list id and the genetic position gpos. The way I implemented this is as a class with

class Coords(object):

    def __init__(self, *args, **kwargs):
        self.f = list(*args, **kwargs)
        self.r = dict()
        for i,e in enumerate(self.f):
            self.r[e] = i

    def __setitem__(self,x,y):
        self.f.__setitem__(x,y)
        self.r.__setitem__(y,x)

    def __getitem__(self,x):
        return self.f.__getitem__(x)

    def __len__(self):
        return self.f.__len__()

now, I have two issues with this. The first one is that the indeces of self.r are floats, which is obviously a bad idea. I was thinking of converting them to strings (with a fixed number of digits), but is there a better idea? The other issue I have is that I want to implement accessing entries via gpos, so if I, for example, would like to access everything between gpos 0.2 and 0.4, I would like to be able to do that using

import numpy as np
Coords(np.arange(1,0,-.1))
c.r[0.2:0.4]

is there an easy way to define that? I was thinking of finding the correct id of the starting and ending positions using binary search and then access self.f using these ids, but is there a way to implement above syntax?

Upvotes: 4

Views: 5651

Answers (2)

lmjohns3
lmjohns3

Reputation: 7592

If you know that your gpos values will (or can) always be stored in sorted order, then I'd definitely recommend using binary search for this task. You can take advantage of the array syntax and numpy's builtin implementation with searchsorted:

>>> gpos_vals = np.linspace(0, 1, 11)
>>> gpos_vals
array([ 0. ,  0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ])
>>> lo, hi = gpos_vals.searchsorted([0.22, 0.52])
>>> lo, hi
(3, 6)
>>> gpos_vals[lo:hi]
array([ 0.3,  0.4,  0.5])

I think this nicely avoids the issues that you pointed out regarding using floats for dictionary keys, which could be problematic.

You could also combine this answer with Jaime's and make a class that looks for a slice inside a custom __getitem__, and then passes the slice parameters to searchsorted as in my snippet:

class GeneticPositions(object):
    def __init__(self, gpos_values):
        self.gpos_values = np.asarray(gpos_values)

    def __getitem__(self, x):
        if isinstance(x, slice):
            lo, hi = self.gpos_values.searchsorted(
                [x.start or 0, x.stop or 1])
            return self.gpos_values[lo:hi]
        return self.gpos_values[x]

Upvotes: 3

Jaime
Jaime

Reputation: 67457

When you index an object with a slice, Python creates a slice object with the inputs you provide. For example, if you do c[0.2:0.4], then the argument passed to c.__getitem__ will be slice(0.2, 0.4). So you could have something like this code in your __getitem__ method:

def __getitem__(self, x):
    if isinstance(x, slice):
        start = x.start
        stop = x.stop
        step = x.step
        # Do whatever you want to do to define your return
    ...

If you want to use this fancy indexing not on the Coords object, but in the self.r dictionary, I think the easiest would be to create a FancyIndexDict that is a subclass of dict, modify its __getitem__ method, and then have self.r be a FancyIndexDict, not a dict.

Upvotes: 5

Related Questions