jeffalstott
jeffalstott

Reputation: 2693

Speeding up all-to-all comparisons with a lookup table on Numpy and/or Pandas

I have two Pandas dataframes, with some common information between them

n_classes = 100
classes = range(n_classes)
activity_data = pd.DataFrame(columns=['Class','Activity'], data=list(zip(classes,rand(n_classes))))

weight_lookuptable = pd.DataFrame(index=classes, columns=classes, data=rand(n_classes,n_classes))
#Important for comprehension: the classes are both the indices and the columns. Every class has a relationship with every other class.

I then want to perform this operation:

q =[sum(activity_data['Activity']*activity_data['Class'].map(weight_lookuptable[c])) for c in activity_data['Class']]

Description: For every class, look up that class' class-to-class weights in the lookup table, and multiply them by their respective classes. Then sum.

Is there a smarter way to do this so as to be faster? It's pretty fast now, but I'll be doing this millions of times and could really an order of magnitude or two reduction.

Maybe there is something clever with making activity_data['Class'] and index. But obviously the biggest opportunity for gains would be to not have the for c in activity_data['Class'] component. I just don't see how to do it.

Upvotes: 0

Views: 116

Answers (1)

DSM
DSM

Reputation: 353379

IIUC, you could use dot, I think:

>>> q = [sum(activity_data['Activity']*activity_data['Class'].map(weight_lookuptable[c])) for c in activity_data['Class']]
>>> new_q = activity_data["Activity"].dot(weight_lookuptable)
>>> np.allclose(q, new_q)
True

which is much faster for me:

>>> %timeit q = [sum(activity_data['Activity']*activity_data['Class'].map(weight_lookuptable[c])) for c in activity_data['Class']]
10 loops, best of 3: 28.8 ms per loop
>>> %timeit new_q = activity_data["Activity"].dot(weight_lookuptable)
1000 loops, best of 3: 218 µs per loop

You can sometimes squeeze out a bit more performance by dropping to bare numpy (although then you have to be more careful to make sure that your indices are aligned):

>>> %timeit new_q = activity_data["Activity"].values.dot(weight_lookuptable.values)
10000 loops, best of 3: 43.4 µs per loop

Upvotes: 1

Related Questions