pythonsortingnumpyoptimizationset-intersection

Reputation: 1169

Indices that intersect and sort two numpy arrays

I have two numpy arrays of integers, both of length several hundred million. Within each array values are unique, and each is initially unsorted.

I would like the indices to each that yield their sorted intersection. For example:

x = np.array([4, 1, 10, 5, 8, 13, 11])
y = np.array([20, 5, 4, 9, 11, 7, 25])

Then the sorted intersection of these is [4, 5, 11], and so we want the indices that turn each of x and y into that array, so we want it to return:

mx = np.array([0, 3, 6])
my = np.array([2, 1, 4])

since then x[mx] == y[my] == np.intersect1d(x, y)

The only solution we have so far involves three different argsorts, so it seems that is unlikely to be optimal.

Each value represents a galaxy, in case that makes the problem more fun.

Upvotes: 8

Answers (4)

ali_m

Reputation: 74152

For a pure numpy solution you could do something like this:

Use np.unique to get the unique values and corresponding indices in x and y separately:

# sorted unique values in x and y and the indices corresponding to their first
# occurrences, such that u_x == x[u_idx_x]
u_x, u_idx_x = np.unique(x, return_index=True)
u_y, u_idx_y = np.unique(y, return_index=True)

Find the intersection of the unique values using np.intersect1d:

# we can assume_unique, which can be faster for large arrays
i_xy = np.intersect1d(u_x, u_y, assume_unique=True)

Finally, use np.in1d to select only the indices that correspond to unique values in x or y that also happen to be in the intersection of x and y:

# it is also safe to assume_unique here
i_idx_x = u_idx_x[np.in1d(u_x, i_xy, assume_unique=True)]
i_idx_y = u_idx_y[np.in1d(u_y, i_xy, assume_unique=True)]

To pull all that together into a single function:

def intersect_indices(x, y):
    u_x, u_idx_x = np.unique(x, return_index=True)
    u_y, u_idx_y = np.unique(y, return_index=True)
    i_xy = np.intersect1d(u_x, u_y, assume_unique=True)
    i_idx_x = u_idx_x[np.in1d(u_x, i_xy, assume_unique=True)]
    i_idx_y = u_idx_y[np.in1d(u_y, i_xy, assume_unique=True)]
    return i_idx_x, i_idx_y

For example:

x = np.array([4, 1, 10, 5, 8, 13, 11])
y = np.array([20, 5, 4, 9, 11, 7, 25])

i_idx_x, i_idx_y = intersect_indices(x, y)

print(i_idx_x, i_idx_y)
# (array([0, 3, 6]), array([2, 1, 4]))

Speed test:

In [1]: k = 1000000

In [2]: %%timeit x, y = np.random.randint(k, size=(2, k))
intersect_indices(x, y)
....: 
1 loops, best of 3: 597 ms per loop

Update:

I initially missed the fact that in your case both x and y contain only unique values. Taking that into account, it's possible to do slightly better by using an indirect sort:

def intersect_indices_unique(x, y):
    u_idx_x = np.argsort(x)
    u_idx_y = np.argsort(y)
    i_xy = np.intersect1d(x, y, assume_unique=True)
    i_idx_x = u_idx_x[x[u_idx_x].searchsorted(i_xy)]
    i_idx_y = u_idx_y[y[u_idx_y].searchsorted(i_xy)]
    return i_idx_x, i_idx_y

Here's a more realistic test case, where x and y both contain unique (but partially overlapping) values:

In [1]: n, k = 10000000, 1000000

In [2]: %%timeit x, y = (np.random.choice(n, size=k, replace=False) for _ in range(2))
intersect_indices(x, y)
....: 
1 loops, best of 3: 593 ms per loop

In [3]: %%timeit x, y = (np.random.choice(n, size=k, replace=False) for _ in range(2))
intersect_indices_unique(x, y)
....: 
1 loops, best of 3: 453 ms per loop

@Divakar's solution is very similar in terms of performance:

In [4]: %%timeit x, y = (np.random.choice(n, size=k, replace=False) for _ in range(2))
searchsorted_based(x, y)
....: 
1 loops, best of 3: 472 ms per loop

Upvotes: 3

Divakar

Reputation: 221524

You could also use np.searchsorted, like so -

def searchsorted_based(x,y):

    # Get argsort for both x and y
    xsort_idx = x.argsort()
    ysort_idx = y.argsort()

    # Sort x and y and store them
    X = x[xsort_idx]
    Y = y[ysort_idx]

    # Find positions of Y in X and the matches by the positions that 
    # shift between 'left' and 'right' based searches. 
    # Use the matches posotions to get corresponding argsort for X.
    x1 = np.searchsorted(X,Y,'left') 
    x2 = np.searchsorted(X,Y,'right') 
    out1 = xsort_idx[x1[x2 != x1]]

    # Repeat for X in Y findings
    y1 = np.searchsorted(Y,X,'left') 
    y2 = np.searchsorted(Y,X,'right')
    out2 = ysort_idx[y1[y2 != y1]]

    return out1, out2

Sample run -

In [100]: x = np.array([4, 1, 10, 5, 8, 13, 11])
     ...: y = np.array([20, 5, 4, 9, 11, 7, 25])
     ...: 

In [101]: searchsorted_based(x,y)
Out[101]: (array([0, 3, 6]), array([2, 1, 4]))

Upvotes: 3

Rory Yorke

Reputation: 2236

Here's an option based on intersect1d's implementation, which is fairly straightforward. It requires one call to argsort.

The admittedly simplistic test passes.

import numpy as np


def my_intersect(x, y):
    """my_intersect(x, y) -> xm, ym
    x, y: 1-d arrays of unique values
    xm, ym: indices into x and y giving sorted intersection
    """
    # basic idea taken from numpy.lib.arraysetops.intersect1d
    aux = np.concatenate((x, y))
    sidx = aux.argsort()
    # Note: intersect1d uses aux[:-1][aux[1:]==aux[:-1]] here - I don't know why the first [:-1] is necessary
    inidx = aux[sidx[1:]] == aux[sidx[:-1]]

    # quicksort is not stable, so must do some work to extract indices
    # (if stable, sidx[inidx.nonzero()]  would be for x)
    # interlace the two sets of indices, and check against lengths
    xym = np.vstack((sidx[inidx.nonzero()],
                     sidx[1:][inidx.nonzero()])).T.flatten()

    xm = xym[xym < len(x)]
    ym = xym[xym >= len(x)] - len(x)

    return xm, ym


def check_my_intersect(x, y):
    mx, my = my_intersect(x, y)
    assert (x[mx] == np.intersect1d(x, y)).all()

    # not really necessary: np.intersect1d returns a sorted list
    assert (x[mx] == sorted(x[mx])).all()
    assert (x[mx] == y[my]).all()


def random_unique_unsorted(n):
    while True:
        x = np.unique(np.random.randint(2*n, size=n))
        if len(x):
            break
    np.random.shuffle(x)
    return x


x = np.array([4, 1, 10, 5, 8, 13, 11])
y = np.array([20, 5, 4, 9, 11, 7, 25])

check_my_intersect(x, y)


for i in range(20):
    x = random_unique_unsorted(100+i)
    y = random_unique_unsorted(200+i)
    check_my_intersect(x, y)

Edit: "Note" comment was confusing (Used ... as speech ellipsis, forgot it was a Python operator too).

Upvotes: 3

Mike Müller

Reputation: 85442

Maybe a pure Python solutions using a dict works for you:

def indices_from_values(a, intersect):
    idx = {value: index for index, value in enumerate(a)}
    return np.array([idx[x] for x in intersect])

intersect = np.intersect1d(x, y)
mx = indices_from_values(x, intersect)
my = indices_from_values(y, intersect)
np.allclose(x[mx], y[my]) and np.allclose(x[mx], np.intersect1d(x, y))

Upvotes: 0

Indices that intersect and sort two numpy arrays

Answers (4)

Update:

Related Questions