Reputation: 42329
I have a pretty simple operation involving two not so large arrays:
i
j
i
, in the position j
of a fourth array (same length as second array)The for
block below works, but gets very slow for not so large arrays (>10000).
Can this implementation be made faster?
import numpy as np
import random
##############################################
# Generate some random data.
#'Nb' is always smaller then 'Na
Na, Nb = 50000, 40000
# List of IDs (could be any string, I use integers here for simplicity)
ids_a = random.sample(range(1, Na * 10), Na)
ids_a = [str(_) for _ in ids_a]
random.shuffle(ids_a)
# Some floats associated to these IDs
vals_in_a = np.random.uniform(0., 1., Na)
# Smaller list of repeated IDs from 'ids_a'
ids_b = random.sample(ids_a, Nb)
# Array to be filled
vals_in_b = np.zeros(Nb)
##############################################
# This block needs to be *a lot* more efficient
#
# For each string in 'ids_a'
for i, id_a in enumerate(ids_a):
# if it exists in 'ids_b'
if id_a in ids_b:
# find where in 'ids_b' this element is located
j = ids_b.index(id_a)
# store in that position the value taken from 'ids_a'
vals_in_b[j] = vals_in_a[i]
Upvotes: 0
Views: 264
Reputation: 53029
In defense of my approach, here is the authoritative implementation:
import itertools as it
def pp():
la,lb = len(ids_a),len(ids_b)
ids = np.fromiter(it.chain(ids_a,ids_b),'<S6',la+lb)
unq,inv = np.unique(ids,return_inverse=True)
vals = np.empty(la,vals_in_a.dtype)
vals[inv[:la]] = vals_in_a
return vals[inv[la:]]
(juanpa()==pp()).all()
# True
timeit(juanpa,number=100)
# 3.1373191522434354
timeit(pp,number=100)
# 2.5256317732855678
That said, @juanpa.arrivillaga's suggestion can also be implemented better:
import operator as op
def ja():
return op.itemgetter(*ids_b)(dict(zip(ids_a,vals_in_a)))
(ja()==pp()).all()
# True
timeit(ja,number=100)
# 2.015202699229121
Upvotes: 1
Reputation: 42329
I tried the approaches by juanpa.arrivillaga and Paul Panzer. The first one is the fastest by far. It is also the simplest. The second one is faster than my original approach, but considerably slower than the first one. It also has the drawback that this line vals[inv_a] = vals_in_a
stores floats into a U5
array, thus converting them into strings. It can be converted back to floats at the end, but I lose digits (unless I'm missing something obvious of course.
Here are the implementations:
def juanpa():
dict_ids_b = {_: i for i, _ in enumerate(ids_b)}
for i, id_a in enumerate(ids_a):
try:
vals_in_b[dict_ids_b[id_a]] = vals_in_a[i]
except KeyError:
pass
return vals_in_b
def Paul():
# 1) concatenate ids_a and ids_b
ids_ab = ids_a + ids_b
# 2) apply np.unique with keyword return_inverse=True
vals, idxs = np.unique(ids_ab, return_inverse=True)
# 3) split the inverse into inv_a and inv_b
inv_a, inv_b = idxs[:len(ids_a)], idxs[len(ids_a):]
# 4) map the values to match the order of uniques: vals[inv_a] = vals_in_a
vals[inv_a] = vals_in_a
# 5) use inv_b to pick the correct values: result = vals[inv_b]
vals_in_b = vals[inv_b].astype(float)
return vals_in_b
Upvotes: 0