Laurent Claessens
Laurent Claessens

Reputation: 695

Length of the intersections between a list an list of list

Note : almost duplicate of Numpy vectorization: Find intersection between list and list of lists

Differences :

x = [500 numbers between 1 and N]
y = [[1, 2, 3], [4, 5, 6, 7], [8, 9], [10, 11, 12], etc. up to N]

Here are some assumptions:

My purpose is to know, among the sublists of y, which have at least 10 intersections with x.

I can obviously make a loop :

def find_best(x, y):
    result = []

    for index, sublist in enumerate(y):
        intersection = set(x).intersection(set(sublist))
        if len(intersection) > 2:  # in real live: > 10
            result.append(index)

    return(result)


x = [1, 2, 3, 4, 5, 6]
y = [[1, 2, 3], [4],  [5, 6], [7], [8, 9, 10, 11]]

res = find_best(x, y)
print(res)   # [0, 2]

Here the result is [0,2] because the first and third sublist of y have 2 elements in intersection with x.

An other method should to parse only once y and count the intesections :

def find_intersec2(x, y):
    n_sublists = len(y)
    res = {num: 0 for num in range(0, n_sublists + 1)}
    for list_no, sublist in enumerate(y):
        for num in sublist:
            if num in x:
                x.remove(num)
                res[list_no] += 1
    return [n for n in range(n_sublists + 1) if res[n] >= 2]

This second method uses more the hypothesis.

Questions :

Upvotes: 1

Views: 330

Answers (3)

Paddy3118
Paddy3118

Reputation: 4772

This uses y to create a linear array mapping every int to the (1 plus), the index of the range or subgroup the int is in; called x2range_counter.

x2range_counter uses a 32 bit array.array type to save memory and can be cached and used for calculations of all x on the same y.

calculating the hits in each range for a particular x is then just indirected array incrementing of a count'er in function count_ranges`.

y = [[1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11, 12]]
x = [5, 3, 1, 11, 8, 10]

range_counter_max = len(y)
extent = y[-1][-1] + 1  # min in y must be 1 not 0 remember.
x2range_counter = array.array('L', [0] * extent)  # efficient 32 bit array storage

# Map any int in any x to appropriate ranges counter.
for range_counter_index, rng in enumerate(y, start=1):
    for n in rng:
        x2range_counter[n] = range_counter_index
print(x2range_counter)  # array('L', [0, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4])

# x2range_counter can be saved for this y and any x on this y.

def count_ranges(x: List[int]) -> List[int]:
    "Number of x-hits on each y subgroup in order"
    # Note: count[0] initially catches errors. count[1..] counts x's in y ranges [0..]
    count = array.array('L', [0] * (range_counter_max + 1))
    for xx in x:
        count[x2range_counter[xx]] += 1
    assert count[0] == 0, "x values must all exist in a y range and y must have all int in its range."

    return count[1:] 

print(count_ranges(x))  # array('L', [1, 2, 1, 2])

I created a class for this, with extra functionality such as returning the ranges rather than the indices; all ranges hit >=M times; (range, hit-count) tuples sorted most hit first.

Range calculations for different x are proportional to x and are simple array lookups rather than any hashing of dicts.

What do you think?

Upvotes: 0

Winson Tanputraman
Winson Tanputraman

Reputation: 3632

Turn y into 2 dicts.

index = { # index to count map
    0 : 0,
    1 : 0,
    2 : 0,
    3 : 0,
    4 : 0
}

y = { # elem to index map
    1: 0,
    2: 0,
    3: 0,
    4: 1,
    5: 2,
    6: 2,
    7: 3,
    8 : 4,
    9 : 4,
    10 : 4,
    11 : 4
}

Since you know y in advance, I don't count the above operations into the time complexity. Then, to count the intersection:

x = [1, 2, 3, 4, 5, 6]
for e in x: index[y[e]] += 1

Since you mentioned x is small, I try to make the time complexity depends only on the size of x (in this case O(n)).

Finally, the answer is the list of keys in index dict where the value is >= 2 (or 10 in real case).

answer = [i for i in index if index[i] >= 2]

Upvotes: 0

Jérôme Richard
Jérôme Richard

Reputation: 50518

Since y contains disjoint ranges and the union of them is also a range, a very fast solution is to first perform a binary search on y and then count the resulting indices and only return the ones that appear at least 10 times. The complexity of this algorithm is O(Nx log Ny) with Nx and Ny the number of items in respectively x and y. This algorithm is nearly optimal (since x needs to be read entirely).


Actual implementation

First of all, you need to transform your current y to a Numpy array containing the beginning value of all ranges (in an increasing order) with N as the last value (assuming N is excluded for the ranges of y, or N+1 otherwise). This part can be assumed as free since y can be computed at compile time in your case. Here is an example:

import numpy as np
y = np.array([1, 4, 8, 10, 13, ..., N])

Then, you need to perform the binary search and check that the values fits in the range of y:

indices = np.searchsorted(y, x, 'right')

# The `0 < indices < len(y)` check should not be needed regarding the input.
# If so, you can use only `indices -= 1`.
indices = indices[(0 < indices) & (indices < len(y))] - 1

Then you need to count the indices and filter the ones with at least :

uniqueIndices, counts = np.unique(indices, return_counts=True)
result = uniqueIndices[counts >= 10]

Here is an example based on your:

x = np.array([1, 2, 3, 4, 5, 6])

# [[1, 2, 3], [4],  [5, 6], [7], [8, 9, 10, 11]]
y = np.array([1, 4, 5, 7, 8, 12])

# Actual simplified version of the above algorithm
indices = np.searchsorted(y, x, 'right') - 1
uniqueIndices, counts = np.unique(indices, return_counts=True)
result = uniqueIndices[counts >= 2]

# [0, 2]
print(result.tolist())

It runs in less than 0.1 ms on my machine on a random input based on your input constraints.

Upvotes: 1

Related Questions