Reputation: 33207
I have a huge list of tuples :
ijs = [(0,1),(0,2),(0,3), (3,2)...]
For a given value v
, I want to get only the (i,j)
pairs that have either i=v
or j=v
(from all possible (i,j) pairs stored in ijs
).
E.g. for v=0
and given ijs = [(0,1),(0,2),(0,3), (3,2)]
, then I should get back only_current = [(0,1),(0,2),(0,3)]
Example:
Please ignore the first 3 lines where I built a list ijs
having tuples inside.
import numpy as np
# IGNORE THIS PART UNTIL THE MAIN LOOP
N= 1000
indices_upper_triangle = np.triu_indices(N, k = 1) # k=1 above diagonal
i,j = indices_upper_triangle
ijs = [(i,j) for i,j in zip(i,j)] # create (i,j) positions
# MAIN LOOP HERE
# Main loop
all_neig_results = list()
for v in range(N): # for each point
# from all possible (i,j) pairs, get only (i,j) pairs that have either i=v or j=v
only_current = [item for item in ijs if v in item]
all_neig_results.append(only_current)
The list comprehension in the loop is super slow.
%timeit [item for item in ijs if v in item]
15.9 s ± 361 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If I remove the check argument if v in item
:
%timeit [item for item in ijs]
1.28 s ± 90.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
How can I optimize [item for item in ijs if v in item]
?
Upvotes: 0
Views: 111
Reputation: 363
If your goal is to find nearest neighbors, I'd recommend using a KDTree, which
can be used to rapidly look up the nearest neighbors of any point.
and avoids having to loop over all pairs of points / indexes.
See this example: How to find set of points in x,y grid using KDTree.query_ball_tree
Concerning the leafsize
parameter:
KDTrees "split up" space into different regions. Each leaf corresponds to a region. if leafsize =1 then it continues to make splits until each region contains only one point. This means building the tree will take a long time, but searches will be fast. If the leaves contain more points, the tree is shallower so building it will be faster. But queries will probably be slower. the "brute-force" bit in the doc refers to the fact that if you have more than 1 point in each region, you need to brute force to figure out nearest neighbors among those in the region.
leafsize=10
will mean the leaf has 10 or less. Each split is such that child nodes have (approximately) half of the points in the parent node, so the precise number of points in each leaf will be variable. there are other implementation details that could affect this.
Upvotes: 1
Reputation: 2004
You are doing O(N^2) work here, because you go over the whole list each time in the inner loop.
You should precompute once:
import collections
def precompute(pairs):
d = collections.defaultdict(list)
for ij in pairs:
d[ij[0]].append(ij)
d[ij[1]].append(ij)
return d
d = precompute(ijs)
to make the work in the inner loop fast:
for v in range(N):
only_current = d[v]
this way you are doing only an O(N) work.
Upvotes: 2