Reputation: 851
I have tens of thousands of lists of tuples, where each tuple within a list consists of an (int, float)
pair. I want to be able to cycle through all the lists of tuples to find the (int, float)
pair where the float
is the maximum value of the float in the list of tuples. Consider several lists of tuples:
[
[(0, 0.3792), (3, 0.5796)],
[0, 0.9365), (1, 0.0512), (18, 0.0123),
[(13, 0.8642)],
[(0, 0.6249), (1, 0.01), (2, 0.01), (3, 0.01), (4, 0.01), (5, 0.01)]
]
For each list of tuples, I want to find the pair where the second number is maximized (e.g., for the first list, I want (3, 0.5796)
; for the 4th item, (0, 0.6249)
should be returned). My current approach is to turn the tuples into numpy arrays and then find argmax and max:
def get_max(doc: List[Tuple[int, float]]) -> Tuple[int, float]:
topic_prob_array = np.array(doc, dtype=np.dtype('int,float'))
return topic_prob_array['f0'][np.argmax(topic_prob_array['f1'])], np.max(topic_prob_array['f1'])
I was hoping to turn this into a numpy vectorized function (via vec_func = np.vectorized(get_max, otypes=[int,float])
or numpy ufunc (via vec_func = np.fromfunc(get_max, nin=1, nout=1)
. I not sure if I am formatting the input and output correctly. My reasoning is that I am sending in a single list of tuples and return a single tuple, hence nin=1, nout=1
. However, I have not been able to successfully get a vectorized version of this to run.
I also tried a solution without relying on numpy
:
def get_max(doc: List[Tuple[int, float]]) -> Tuple[int, float]:
ids, probabilities = zip(*doc)
return ids[np.argmax(probabilities)], np.max(probabilities)
Both take about the same amount of time to run. For my list of about 80k, this takes about 1 minute, 10 seconds for both implementations. I'd really like to get this down if it's possible.
Upvotes: 1
Views: 182
Reputation: 231540
In [462]: alist
Out[462]:
[[(0, 0.3792), (3, 0.5796)],
[(0, 0.9365), (1, 0.0512), (18, 0.0123)],
[(13, 0.8642)],
[(0, 0.6249), (1, 0.01), (2, 0.01), (3, 0.01), (4, 0.01), (5, 0.01)]]
In [463]: blist = alist*10000 # bigger test list
Playing around with alternatives, I found this "brute force" function is fastest (though not by much):
def get_max3(doc):
m = doc[0]
for i in doc[1:]:
if i[1]>m[1]: m=i
return m
For the small list, the list comprehension is slightly faster, for the big list, the map version has the edge - but not by much.
In [465]: [get_max3(i) for i in alist]
Out[465]: [(3, 0.5796), (0, 0.9365), (13, 0.8642), (0, 0.6249)]
In [466]: timeit [get_max3(i) for i in alist]
1.9 µs ± 51.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [467]: timeit list(map(get_max3,blist))
15 ms ± 7.77 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Versions using numpy
are all much slower; it takes time to convert the list of tuples into a numpy array (even structured array).
Upvotes: 1
Reputation: 14236
Do you need to use numpy
for this? We can take a functional approach and map
the max
function with a custom key
across the whole data set.
from functools import partial
from operator import itemgetter
snd = itemgetter(1)
p = partial(max, key=snd)
list(map(p, data))
>>> [(3, 0.5796), (0, 0.9365), (13, 0.8642), (0, 0.6249)]
Then a quick timing across 80K random tuples from your original dataset.
from random import choice
result = []
for _ in range(80_000):
result.append(choice(data))
%timeit list(map(p, result))
42.2 ms ± 686 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 1
Reputation: 155516
The optimized non-numpy
solution to this is:
from operator import itemgetter
get1 = itemgetter(1)
all_lists = [...] # Whatever your actual list of list of tuples comes from
all_maxes = [max(lst, key=get1) for lst in all_lists]
numpy
isn't likely to gain you much, since the work done is relatively cheap, and if you're only converting to numpy
arrays for a single operation, the scope of benefit is smaller.
Upvotes: 3
Reputation: 1807
Like @gold_cy mentioned, I'm not sure if you're looking for a numpy
answer. A non-numpy
answer could be:
list_tuple = [
[(0, 0.3792), (3, 0.5796)],
[(0, 0.9365), (1, 0.0512), (18, 0.0123)],
[(13, 0.8642)],
[(0, 0.6249), (1, 0.01), (2, 0.01), (3, 0.01), (4, 0.01), (5, 0.01)]
]
[sorted(tup, key=lambda x: x[1], reverse=True).pop(0) for tup in list_tuple]
>>> [(3, 0.5796), (0, 0.9365), (13, 0.8642), (0, 0.6249)]
Upvotes: 1