CopyOfA
CopyOfA

Reputation: 851

Fast method to cycle through multiple lists of tuples to find max of each tuple list

I have tens of thousands of lists of tuples, where each tuple within a list consists of an (int, float) pair. I want to be able to cycle through all the lists of tuples to find the (int, float) pair where the float is the maximum value of the float in the list of tuples. Consider several lists of tuples:

[
[(0, 0.3792), (3, 0.5796)],
[0, 0.9365), (1, 0.0512), (18, 0.0123),
[(13, 0.8642)],
[(0, 0.6249), (1, 0.01), (2, 0.01), (3, 0.01), (4, 0.01), (5, 0.01)]
]

For each list of tuples, I want to find the pair where the second number is maximized (e.g., for the first list, I want (3, 0.5796); for the 4th item, (0, 0.6249) should be returned). My current approach is to turn the tuples into numpy arrays and then find argmax and max:

def get_max(doc: List[Tuple[int, float]]) -> Tuple[int, float]:
            
   topic_prob_array = np.array(doc, dtype=np.dtype('int,float'))
   return topic_prob_array['f0'][np.argmax(topic_prob_array['f1'])], np.max(topic_prob_array['f1'])

I was hoping to turn this into a numpy vectorized function (via vec_func = np.vectorized(get_max, otypes=[int,float]) or numpy ufunc (via vec_func = np.fromfunc(get_max, nin=1, nout=1). I not sure if I am formatting the input and output correctly. My reasoning is that I am sending in a single list of tuples and return a single tuple, hence nin=1, nout=1. However, I have not been able to successfully get a vectorized version of this to run.

I also tried a solution without relying on numpy:

def get_max(doc: List[Tuple[int, float]]) -> Tuple[int, float]:

   ids, probabilities = zip(*doc)
   return ids[np.argmax(probabilities)], np.max(probabilities)

Both take about the same amount of time to run. For my list of about 80k, this takes about 1 minute, 10 seconds for both implementations. I'd really like to get this down if it's possible.

Upvotes: 1

Views: 182

Answers (4)

hpaulj
hpaulj

Reputation: 231540

In [462]: alist
Out[462]: 
[[(0, 0.3792), (3, 0.5796)],
 [(0, 0.9365), (1, 0.0512), (18, 0.0123)],
 [(13, 0.8642)],
 [(0, 0.6249), (1, 0.01), (2, 0.01), (3, 0.01), (4, 0.01), (5, 0.01)]]
In [463]: blist = alist*10000    # bigger test list

Playing around with alternatives, I found this "brute force" function is fastest (though not by much):

def get_max3(doc):
    m = doc[0]
    for i in doc[1:]:
        if i[1]>m[1]: m=i
    return m

For the small list, the list comprehension is slightly faster, for the big list, the map version has the edge - but not by much.

In [465]: [get_max3(i) for i in alist]
Out[465]: [(3, 0.5796), (0, 0.9365), (13, 0.8642), (0, 0.6249)]

In [466]: timeit [get_max3(i) for i in alist]
1.9 µs ± 51.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [467]: timeit list(map(get_max3,blist))
15 ms ± 7.77 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Versions using numpy are all much slower; it takes time to convert the list of tuples into a numpy array (even structured array).

Upvotes: 1

gold_cy
gold_cy

Reputation: 14236

Do you need to use numpy for this? We can take a functional approach and map the max function with a custom key across the whole data set.

from functools import partial
from operator import itemgetter

snd = itemgetter(1)
p = partial(max, key=snd)
list(map(p, data))
>>> [(3, 0.5796), (0, 0.9365), (13, 0.8642), (0, 0.6249)]

Then a quick timing across 80K random tuples from your original dataset.

from random import choice

result = []
for _ in range(80_000):
    result.append(choice(data))

%timeit list(map(p, result))
42.2 ms ± 686 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 1

ShadowRanger
ShadowRanger

Reputation: 155516

The optimized non-numpy solution to this is:

from operator import itemgetter

get1 = itemgetter(1)

all_lists = [...]  # Whatever your actual list of list of tuples comes from

all_maxes = [max(lst, key=get1) for lst in all_lists]

numpy isn't likely to gain you much, since the work done is relatively cheap, and if you're only converting to numpy arrays for a single operation, the scope of benefit is smaller.

Upvotes: 3

irahorecka
irahorecka

Reputation: 1807

Like @gold_cy mentioned, I'm not sure if you're looking for a numpy answer. A non-numpy answer could be:

list_tuple = [
    [(0, 0.3792), (3, 0.5796)],
    [(0, 0.9365), (1, 0.0512), (18, 0.0123)],
    [(13, 0.8642)],
    [(0, 0.6249), (1, 0.01), (2, 0.01), (3, 0.01), (4, 0.01), (5, 0.01)]
]

[sorted(tup, key=lambda x: x[1], reverse=True).pop(0) for tup in list_tuple]
>>> [(3, 0.5796), (0, 0.9365), (13, 0.8642), (0, 0.6249)]

Upvotes: 1

Related Questions