Superbest
Superbest

Reputation: 26612

Searching in a vector is too computationally expensive

I need to perform a block of code like the following:

x = some_number;
y = some_other_number;

u = a_vector_of_numbers;
v = another_vector_of_numbers;
% u and v are of equal size

r1 = ((x == u) | (x == v));   % Expensive!
r2 = ((y == u) | (y == v));   % Expensive!

q = any(r1 & r2);

You can think of this as: x and y are two nodes on graph, and unless I am mistaken, this checks if x and y are connected using an adjacency list [r1, r2]. In other words, I am trying to answer the question: "Is there such an index i that both x and y can be found at r1(i) or r2(i)?"

I need to do this repeatedly. Both r1 and r2 can potentially contain up to thousands of unique values (number of nodes on the graph on the order of 104) and their length is hundreds of thousands (number of edges on the order of 106).

My profiler tells me the two lines I have indicated with comments consume 99% of run-time, and my program takes quite a while to run, so I am wondering: How much more can this be optimized? What is the fundamental limit to the minimum computation time, and how close to it am I?

Also, it would be quite easy to outsource this particular code to another language. Could do that ever result in a significant performance gain?

Upvotes: 1

Views: 130

Answers (2)

High Performance Mark
High Performance Mark

Reputation: 78334

I haven't tested this suggestion, too much effort to set up some realistic test data, but ...

Have you tried creating an adjacency matrix for your graph and using that for your enquiries ? While creating the matrix (once) would be a relatively expensive operation, the check for the presence of an edge would be much cheaper than reading both adjacency lists (I think).

If you stick with your current algorithm (or, more to the point, with your current data structure) I'd be surprised if you got much speed-up simply by offloading the work to an implementation in another language. Using another language doesn't change the fact that you are reading through long vectors of data looking for values.

Upvotes: 3

Pursuit
Pursuit

Reputation: 12345

If your first check (r1) is likely to remove most of the results, your second check can be pre-filtered to only check the possible matches. The code for that would look like this:

mask_r1 = ((x == u) | (x == v));   % Expensive!
r2 = ((y == u(mask_r1)) | (y == v(mask_r1)));   % Less expensive!
q = any(r2);

I have even seen cases (usually in older versions of Matlab), where adding a find to the first line improved performance. But I don't think that is true anymore (they've pulled that optimization into the parser.) Some timing results of the three methods (original, using a logical mask, using an explicit index list) are below:

x = 2;
y = 3;
v = randi(200,1e5,1);
u = randi(200,1e5,1);

tic;
for ix = 1:1000
    r1 = ((x == u) | (x == v));   % Expensive!
    r2 = ((y == u) | (y == v));   % Expensive!
    q = any(r1 & r2);
end
toc;  %1.175234


tic;
for ix = 1:1000
    mask_r1 = ((x == u) | (x == v));   % Expensive!
    r2 = ((y == u(mask_r1)) | (y == v(mask_r1)));   % Less expensive!
    q = any(r2);
end
toc;  %0.878857

tic;
for ix = 1:1000
    ixs_r1 = find(((x == u) | (x == v)));   % Expensive!
    r2 = ((y == u(r1)) | (y == v(r1)));   % Less expensive!
    q = any(r2);
end
toc;  %1.118103

Upvotes: 4

Related Questions