Heberto Mayorquin
Heberto Mayorquin

Reputation: 11091

Fastest inverse operation of index to label in a numpy array: a dictionary (hash) of label to index

I find myself consistently facing this problem in a couple of different scenarios. So I thought about sharing it here and see if there is an optimal way to solve it.

Suppose that I have a big array of whatever X and an another array of the same size of X called y that has on it the label to whose x belongs. So like the following.

X = np.array(['obect1', 'object2', 'object3', 'object4', 'object5'])
y = np.array([0, 1, 1, 0, 2])

What I desire is a to build a dictionary / hash that uses the set of labels as keys and the indexes of all the objects with those labels in X as items. So in this case the desired output will be:

{0: (array([0, 3]),), 1: (array([1, 2]),), 2: (array([4]),)}

Note that actually what is on X does not matter but I included it for the sake of completeness.

Now, my naive solution for the problem is iterate throughout all the labels and use np.where==label to build the dictionary. In more detail, I use this function:

def get_key_to_indexes_dic(labels):
    """
    Builds a dictionary whose keys are the labels and whose
    items are all the indexes that have that particular key
    """

    # Get the unique labels and initialize the dictionary
    label_set = set(labels)
    key_to_indexes = {}

    for label in label_set:
        key_to_indexes[label] = np.where(labels==label)

    return key_to_indexes

So now the core of my question: Is there a way to do better? is there a natural way to solve this using numpy functions? is my approach misguided somehow?


As a lateral matter of less importance: what is the complexity of the solution in the definition above? I believe that the complexity of the solution is the following:

enter image description here

Or in words the number of labels times the complexity of using np.where in a set of the size of y plus the complexity of making a set out of an array. Is this correct?

P.D. I could not find related post with this specific question, if you have suggestions to change the title or anything I would be grateful.

Upvotes: 4

Views: 1289

Answers (4)

Devin Cairns
Devin Cairns

Reputation: 700

I've also struggled to find a "numpythonic" way to solve this type of problem. This is the best approach I've come up with, although requiring a bit more memory:

def get_key_to_indexes_dict(labels):
    indices = numpy.argsort(labels)
    bins = numpy.bincount(labels)
    indices = numpy.split(indices, numpy.cumsum(bins[bins > 0][:-1]))
    return dict(zip(numpy.unique(labels), indices))

Upvotes: 0

user6758673
user6758673

Reputation: 589

Assuming that the labels are consecutive integers [0, m] and taking n = len(labels), the complexity for set(labels) is O(n) and the complexity for np.where in the loop is O(m*n). However, the overall complexity is written as O(m*n) and not O(m*n + n), see "Big O notation" on wikipedia.


There are two things you can do to improve the performance: 1) use a more efficient algorithm (lower complexity) and 2) replace Python loops with fast array operations.

The other answers currently posted do exactly this, and with very sensible code. However an optimal solution would be both fully vectorized and have O(n) complexity. This can be accomplished using a certain lower level function from Scipy:

def sparse_hack(labels):
    from scipy.sparse._sparsetools import coo_tocsr

    labels = labels.ravel()
    n = len(labels)
    nlabels = np.max(labels) + 1

    indices = np.arange(n)
    sorted_indices = np.empty(n, int)
    offsets = np.zeros(nlabels+1, int)
    dummy = np.zeros(n, int)
    coo_tocsr(nlabels, 1, n, labels, dummy, indices, 
                             offsets, dummy, sorted_indices)

    return sorted_indices, offsets

The source for coo_tocsr can be found here. The way I used it, it essentially performs an indirect counting sort. To be honest, this is a rather obscure method and I advise you to use one of the approaches in the other answers.

Upvotes: 1

Eelco Hoogendoorn
Eelco Hoogendoorn

Reputation: 10759

The numpy_indexed package (disclaimer: I am its author) can be used to solve such problems in a fully vectorized manner, and having O(nlogn) worst case time-complexity:

import numpy_indexed as npi
indices = np.arange(len(labels))
unique_labels, indices_per_label = npi.group_by(labels, indices)

Note that for many common applications of such functionality, such as computing a sum or mean over group labels, it is more efficient not to compute the split list of indices, but to make use of the functions for that in npi; ie, npi.group_by(labels).mean(some_corresponding_array), rather than looping through indices_per_label and taking the mean over those indices.

Upvotes: 2

chthonicdaemon
chthonicdaemon

Reputation: 19810

You only need to traverse once if you use a dictionary to store the indexes as you go through:

from collections import defaultdict

def get_key_to_indexes_ddict(labels):
    indexes = defaultdict(list)
    for index, label in enumerate(labels):
        indexes[label].append(index)

The scaling seems much like you have analysed for your option, for the function above it's O(N) where N is the size of y since checking if a value is in a dictionary is O(1).

So the interesting thing is that since np.where is going so much faster in its traversal, as long as there are only a small number of labels, your function is faster. Mine seems faster when there are many distinct labels.

Here is how the functions scale:

scaling graph

The blue lines are your function, the red lines are mine. The line styles indicate the number of distinct labels. {10: ':', 100: '--', 1000: '-.', 10000: '-'}. You can see that my function is relatively independent of number of labels, while yours quickly becomes slow when there are many labels. If you have few labels, you're better off with yours.

Upvotes: 2

Related Questions