Axel Puig
Axel Puig

Reputation: 1334

Vectorize a function for lists of lists

I need to use an efficient python data structure with these characteristics:

Here is an example with a deepness of 3

[[[1,2] , [3,4,5]],
 [[6,7,8] , [9] , [10] , [11,12]],
 [[13] , [14,15] , [16,17,18]]

Most of the time the structure would contain numpy arrays or numbers, but it could also be another object like a dict. However, always the same datatype in a given structure.

My main problem is that I need to apply functions to such structures (like a "vectorized" function). I want my functions to take several structures with the same shape as arguments, and return other ones.

What would be the most efficient in your opinion:

I am especially looking for RAM efficiency.

I hope I've made my problem clear, thanks for your help.

Upvotes: 0

Views: 201

Answers (2)

Alain T.
Alain T.

Reputation: 42129

If you want to get truly vectorized processing, you will need to use a library such as numpy. But these will likely limit the datatypes that you can support in order to allow processing by GPUs.

In either case, you could use a dictionary to flatten the structure and facilitate batch processing of elements of the structure. This would be a dictionary with tuples as keys where each entry in the tuple represents the index of value at that level:

For example:

[ 
  [ [1,2] ,   [3,4,5] ],
  [ [6,7,8] , [9] ,     [10] ,     [11,12] ],
  [ [13] ,    [14,15] , [16,17,18] ]
]

could be represented in such a dictionary as:

{ 
  (0,0,0) : 1,
  (0,0,1) : 2,
  (0,1,0) : 3,
  (0,1,1) : 4,
  (0,1,2) : 5,
  (1,0,0) : 6,
  (1,0,1) : 7,
  (1,0,2) : 8,
  (1,1,0) : 9,
  (1,2,0) : 10,
  (1,3,0) : 11,
  (1,3,1) : 12,
  (2,0,0) : 13,
  (2,1,0) : 14,
  (2,1,1) : 15,
  (2,1,0) : 16,
  (2,1,1) : 17,
  (2,1,2) : 18
}

This could also be represented in numpy using two arrays (one for level indices and one for data)

Processing between structures of this type would provide fast traversal of leaf values in the tree structure while maintaining the relationships between branches.

for example:

# sum of values under second branch:
result = sum( value for level,value in data.items() if level[0] == 1 )

# or using numpy:
result = np.sum(data[levels[:,0]==1]) 

# adding two structures:
result = { k:data1.get(k,0)+data2.get(k,0) for k in set((*data1,*data2)) }

# or using numpy (assuming same levels in both structures)
resultLevels, resultData = levels1,data1+data2

# numpy adding structures with different levels is a bit more involved
# but still feasible.

Upvotes: 1

Axel Puig
Axel Puig

Reputation: 1334

Thanks Alain T.

I have used your idea and written this class to handle my data. This way I can access my elements with slice almost like with a numpy array, and I can use the data parameter (flattened data) with numpy vectorized functions:

class DataStructure():
    __slots__ = ["data", "positions"]

    def __init__(self, data, place):
        self.data = np.array(data)
        self.positions = np.array(place)

    def __getitem__(self, item):
        item = (item,) if not isinstance(item, tuple) else item
        mask = np.full((len(self.positions),), True)
        for i, selection in enumerate(item):
            if not isinstance(selection, slice):
                mask &= self.positions[:, i] == selection
            else:
                start, stop, step = selection.indices(len(self.positions))
                mask &= np.isin(self.positions[:,i], range(start,stop,step))
        return self.data[mask]

PS: Don't hesitate to tell me if it can be optimized, especially with my use of slices.

Upvotes: 0

Related Questions