Memory and excecution speed in Matlab

Question

I am trying to create random lines and select some of them, which are really rare. My code is rather simple, but to get something that I can use I need to create very large vectors(i.e.: <100000000 x 1, tracks variable in my code). Is there any way to be able to creater larger vectors and to reduce the time needed for all those calculations?

My code is

%Initial line values

tracks=input('Give me the number of muon tracks: ');
width=1e-4;
height=2e-4;

Ystart=15.*ones(tracks,1);
Xstart=-40+80.*rand(tracks,1);
%Xend=-40+80.*rand(tracks,1);
Xend=laprnd(tracks,1,Xstart,15);
X=[Xstart';Xend'];
Y=[Ystart';zeros(1,tracks)];
b=(Ystart.*Xend)./(Xend-Xstart);
hot=0;
cold=0;

for i=1:tracks
    if ((Xend(i,1)-width/2)||(b(i,1)0))
        plot(X(:, i),Y(:, i),'r');%the chosen ones!
        hold all
        hot=hot+1;
    else
        %plot(X(:, i),Y(:, i),'b');%the rest of them
        %hold all
        cold=cold+1;
    end
end

I am also using and calling a Laplace distribution generator made my Elvis Chen which can be found here

function y  = laprnd(m, n, mu, sigma)
%LAPRND generate i.i.d. laplacian random number drawn from laplacian distribution
%   with mean mu and standard deviation sigma. 
%   mu      : mean
%   sigma   : standard deviation
%   [m, n]  : the dimension of y.
%   Default mu = 0, sigma = 1. 
%   For more information, refer to
%   http://en.wikipedia.org./wiki/Laplace_distribution

%   Author  : Elvis Chen (bee33@sjtu.edu.cn)
%   Date    : 01/19/07

%Check inputs
if nargin < 2
    error('At least two inputs are required');
end

if nargin == 2
    mu = 0; sigma = 1;
end

if nargin == 3
    sigma = 1;
end

% Generate Laplacian noise
u = rand(m, n)-0.5;
b = sigma / sqrt(2);
y = mu - b * sign(u).* log(1- 2* abs(u));

The result plot is

Rody Oldenhuis · Accepted Answer

As you indicate, your problem is two-fold. On the one hand, you have memory issues because you need to do so many trials. On the other hand, you have performance issues, because you have to process all those trials.

Solutions to each issue often have a negative impact on the other issue. IMHO, the best approach would be to find a compromise.

More trials are only possible of you get rid of those gargantuan arrays that are required for vectorization, and use a different strategy to do the loop. I will give priority to the possibility of using more trials, possibly at the cost of optimal performance.

When I execute your code as-is in the Matlab profiler, it immediately shows that the initial memory allocation for all your variables takes a lot of time. It also shows that the plot and hold all commands are the most time-consuming lines of them all. Some more trial-and-error shows that there is a disappointingly low maximum value for the trials you can do before OUT OF MEMORY errors start appearing.

The loop can be accelerated tremendously if you know a few things about its limitations in Matlab. In older versions of Matlab, it used to be true that loops should be avoided completely in favor of 'vectorized' code. In recent versions (I believe R2008a and up), the Mathworks introduced a piece of technology called the JIT accelerator (Just-in-Time compiler) which translates M-code into machine language on the fly during execution. Simply put, the JIT accelerator allows your code to bypass Matlab's interpreter and talk much more directly with the underlying hardware, which can save a lot of time.

The advice you'll hear a lot that loops should be avoided in Matlab, is no longer generally true. While vectorization still has its value, any procedure of sizable complexity that is implemented using only vectorized code is often illegible, hard to understand, hard to change and hard to upkeep. An implementation of the same procedure that uses loops, often has none of these drawbacks, and moreover, it will quite often be faster and require less memory.

Unfortunately, the JIT accelerator has a few nasty (and IMHO, unnecessary) limitations that you'll have to learn about.

One such thing is plot; it's generally a better idea to let a loop do nothing other than collect and manipulate data, and delay any plotting commands etc. until after the loop.

Another such thing is hold; the hold function is not a Matlab built-in function, meaning, it is implemented in M-language. Matlab's JIT accelerator is not able to accelerate non-builtin functions when used in a loop, meaning, your entire loop will run at Matlab's interpretation speed, rather than machine-language speed! Therefore, also delay this command until after the loop :)

Now, in case you're wondering, this last step can make a HUGE difference -- I know of one case where copy-pasting a function body into the upper-level loop caused a 1200x performance improvement. Days of execution time had been reduced to minutes!).

There is actually another minor issue in your loop (which is really small, and rather inconvenient, I will immediately agree with) -- the name of the loop variable should not be i. The name i is the name of the imaginary unit in Matlab, and the name resolution will also unnecessarily consume time on each iteration. It's small, but non-negligible.

Now, considering all this, I've come to the following implementation:

function [hot, cold, h] = MuonTracks(tracks)

    % NOTE: no variables larger than 1x1 are initialized

    width  = 1e-4;
    height = 2e-4;

    % constant used for Laplacian noise distribution  
    bL = 15 / sqrt(2);

    % Loop through all tracks
    X = [];
    hot = 0;  
    ii = 0;          
    while ii <= tracks

        ii = ii + 1;

        % Note that I've inlined (== copy-pasted) the original laprnd()
        % function call. This was necessary to work around limitations 
        % in loops in Matlab, and prevent the nececessity of those HUGE 
        % variables. 
        %
        % Of course, you can still easily generalize all of this: 

        % the new data
        u = rand-0.5;

        Ystart = 15; 
        Xstart = 800*rand-400;
        Xend   = Xstart - bL*sign(u)*log(1-2*abs(u));

        b = (Ystart*Xend)/(Xend-Xstart);


        % the test
        if ((b < height && b > 0)) ||...
            (Xend < width/2 && Xend > -width/2)

            hot = hot+1;

            % growing an array is perfectly fine when the chances of it
            % happening are so slim
            X = [X [Xstart; Xend]]; %#ok

        end
    end

    % This is trivial to do here, and prevents an 'else' in the loop
    cold = tracks - hot;

    % Now plot the chosen ones
    h = figure;
    hold all
    Y = repmat([15;0], 1, size(X,2));
    plot(X, Y, 'r'); 

end

With this implementation, I can do this:

>> tic, MuonTracks(1e8); toc
Elapsed time is 24.738725 seconds.

with a completely negligible memory footprint.

The profiler now also shows a nice and even distribution of effort along the code; no lines that really stand out because of their memory use or performance.

It's possibly not the fastest possible implementation (if anyone sees obvious improvements, please, feel free to edit them in). But, if you're willing to wait, you'll be able to do MuonTracks(1e23) (or higher :)

I've also done an implementation in C, which can be compiled into a Matlab MEX file:

/* DoMuonCounting.c */

#include 
#include 
#include 
#include 
#include 


void CountMuons(
    unsigned long long tracks,
    unsigned long long *hot, unsigned long long *cold, double *Xout);


/* simple little helper functions */
double sign(double x) { return (x>0)-(x<0); }
double rand_double()  { return (double)rand()/(double)RAND_MAX; }


/* the gateway function */
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
    int
        dims[] = {1,1};

    const mxArray
        /* Output arguments */
        *hot_out  = plhs[0] = mxCreateNumericArray(2,dims, mxUINT64_CLASS,0),
        *cold_out = plhs[1] = mxCreateNumericArray(2,dims, mxUINT64_CLASS,0),
        *X_out    = plhs[2] = mxCreateDoubleMatrix(2,10000, mxREAL);

    const unsigned long long
        tracks = (const unsigned long long)mxGetPr(prhs[0])[0];

    unsigned long long
        *hot  = (unsigned long long*)mxGetPr(hot_out),
        *cold = (unsigned long long*)mxGetPr(cold_out);
    double
        *Xout = mxGetPr(X_out);

    /* call the actual function, and return */
    CountMuons(tracks, hot,cold, Xout);
}


// The actual muon counting
void CountMuons(
    unsigned long long tracks,
    unsigned long long *hot, unsigned long long *cold, double *Xout)
{
    const double
        width  = 1.0e-4,
        height = 2.0e-4,
        bL     = 15.0/sqrt(2.0),
        Ystart = 15.0;

    double
        Xstart,
        Xend,
        u,
        b;
    unsigned long long
        i = 0ul;


    *hot  = 0ul;
    *cold = tracks;

    /* seed the RNG */
    srand((unsigned)time(NULL));

    /* aaaand start! */
    while (i++ < tracks)
    {
        u = rand_double() - 0.5;

        Xstart = 800.0*rand_double() - 400.0;
        Xend   = Xstart - bL*sign(u)*log(1.0-2.0*fabs(u));

        b = (Ystart*Xend)/(Xend-Xstart);

        if ((b < height && b > 0.0) || (Xend < width/2.0 && Xend > -width/2.0))
        {
            Xout[0 + *hot*2] = Xstart;
            Xout[1 + *hot*2] = Xend;
            ++(*hot);
            --(*cold);
        }
    }
}

compile in Matlab with

mex DoMuonCounting.c

(after having run mex setup :) and then use it in conjunction with a small M-wrapper like this:

function [hot,cold, h] = MuonTrack2(tracks)

    % call the MEX function 
    [hot,cold, Xtmp] = DoMuonCounting(tracks);

    % process outputs, and generate plots

    hot = uint32(hot); % circumvents limitations in 32-bit matlab

    X = Xtmp(:,1:hot);
    clear Xtmp

    h = NaN;
    if ~isempty(X)
        h = figure;
        hold all         
        Y = repmat([15;0], 1, hot);
        plot(X, Y, 'r');       
    end    
end

which allows me to do

>> tic, MuonTrack2(1e8); toc
Elapsed time is 14.496355 seconds.

Note that the memory footprint of the MEX version is slightly larger, but I think that's nothing to worry about.

The only flaw I see is the fixed maximum number of Muon counts (hard-coded as 10000 as the initial array size of Xout; needed because there are no dynamically growing arrays in standard C)...if you're worried this limit could be broken, simply increase it, change it to be equal to a fraction of tracks, or do some smarter (but more painful) dynamic array-growing tricks.

Memory and excecution speed in Matlab

Answers (2)

Related Questions