Gabriel
Gabriel

Reputation: 42329

Speed up search of array element in second array

I have a pretty simple operation involving two not so large arrays:

  1. For every element in the first (larger) array, located in position i
  2. Find if it exists in the second (smaller) array
  3. If it does, find its index in the second array: j
  4. Store a float taken from a third array (same length as first array) in the position i, in the position j of a fourth array (same length as second array)

The for block below works, but gets very slow for not so large arrays (>10000).

Can this implementation be made faster?

import numpy as np
import random

##############################################
# Generate some random data.
#'Nb' is always smaller then 'Na
Na, Nb = 50000, 40000

# List of IDs (could be any string, I use integers here for simplicity)
ids_a = random.sample(range(1, Na * 10), Na)
ids_a = [str(_) for _ in ids_a]
random.shuffle(ids_a)
# Some floats associated to these IDs
vals_in_a = np.random.uniform(0., 1., Na)

# Smaller list of repeated IDs from 'ids_a'
ids_b = random.sample(ids_a, Nb)
# Array to be filled
vals_in_b = np.zeros(Nb)
##############################################

# This block needs to be *a lot* more efficient
#
# For each string in 'ids_a'
for i, id_a in enumerate(ids_a):
    # if it exists in 'ids_b'
    if id_a in ids_b:
        # find where in 'ids_b' this element is located
        j = ids_b.index(id_a)
        # store in that position the value taken from 'ids_a'
        vals_in_b[j] = vals_in_a[i]

Upvotes: 0

Views: 264

Answers (2)

Paul Panzer
Paul Panzer

Reputation: 53029

In defense of my approach, here is the authoritative implementation:

import itertools as it

def pp():
    la,lb = len(ids_a),len(ids_b)
    ids = np.fromiter(it.chain(ids_a,ids_b),'<S6',la+lb)
    unq,inv = np.unique(ids,return_inverse=True)
    vals = np.empty(la,vals_in_a.dtype)
    vals[inv[:la]] = vals_in_a
    return vals[inv[la:]]

(juanpa()==pp()).all()
# True

timeit(juanpa,number=100)
# 3.1373191522434354
timeit(pp,number=100)
# 2.5256317732855678

That said, @juanpa.arrivillaga's suggestion can also be implemented better:

import operator as op

def ja():
    return op.itemgetter(*ids_b)(dict(zip(ids_a,vals_in_a)))

(ja()==pp()).all()
# True
timeit(ja,number=100)
# 2.015202699229121

Upvotes: 1

Gabriel
Gabriel

Reputation: 42329

I tried the approaches by juanpa.arrivillaga and Paul Panzer. The first one is the fastest by far. It is also the simplest. The second one is faster than my original approach, but considerably slower than the first one. It also has the drawback that this line vals[inv_a] = vals_in_a stores floats into a U5 array, thus converting them into strings. It can be converted back to floats at the end, but I lose digits (unless I'm missing something obvious of course.

Here are the implementations:

def juanpa():
    dict_ids_b = {_: i for i, _ in enumerate(ids_b)}
    for i, id_a in enumerate(ids_a):
        try:
            vals_in_b[dict_ids_b[id_a]] = vals_in_a[i]
        except KeyError:
            pass

    return vals_in_b


def Paul():
    # 1) concatenate ids_a and ids_b
    ids_ab = ids_a + ids_b
    # 2) apply np.unique with keyword return_inverse=True
    vals, idxs = np.unique(ids_ab, return_inverse=True)
    # 3) split the inverse into inv_a and inv_b
    inv_a, inv_b = idxs[:len(ids_a)], idxs[len(ids_a):]
    # 4) map the values to match the order of uniques: vals[inv_a] = vals_in_a
    vals[inv_a] = vals_in_a
    # 5) use inv_b to pick the correct values: result = vals[inv_b]
    vals_in_b = vals[inv_b].astype(float)

    return vals_in_b

Upvotes: 0

Related Questions