Reputation: 3904

Improving MATLAB code speed

Would be grateful for some pointers. I am reading about 1M rows of data and it is taking almost 24 hours with the following code. How can I improve the execution time?

The array Day contains the value of the n^th day from the start and there are more then one record for a particular day. The program checks if a particular id (stored in unique_id) is repeated within 180 days.

%// calculating the number of repeats within 180 days
fid2 = 'data_050913/Unique_id_repeat_count1.xlsx';
fid1 = 'data_050913/data_050913_2000.csv';

fid_data = fopen(fid1);
data     = fgetl(fid_data); %// the first line, title line
ep       = 0; %// position point number

while 1
    data = fgetl(fid_data);
    if(length(data)<10)
      break;
    end
    ep = ep+1;
    id = find(data == ',');
    unique_id(ep) = str2num(data(1:id(1)-1)); 
    day(ep) = str2num(data(id(8)+1:id(9)-1)); 
end

repeat = zeros(ep,1);

tic
i = 1; 
count = 0;
while i <= ep
    j = i+1;
    while ( (j<=ep) && (day(j)<= day(i)+179) )
        if unique_id(i) == unique_id(j)
           count = 1;
           break;
        end
        j = j+1;
    end

    repeat(i,1) = count;

    count = 0;
    i = i+1;
end
toc

i = 1;
k = 1;
while i<=ep
    count = repeat(i,1);
    j=i;
    while (day(j) == day(i))
        count = repeat(j,1)+count;
        j = j+1;
        if j > ep
            break;
        end
    end

    day_final(k,1)= day(i);
    repeat_final(k,1) = count;

    k = k+1;
    i = j;
end

xlswrite(fid2,day_final,'Repeat_Count','A2');
xlswrite(fid2,repeat_final,'Repeat_Count','B2');

Thanks

Upvotes: 5

Answers (4)

Semjon Mössinger

Reputation: 1900

You can check which line needs most time with MATLAB's profiler.
One thing you should know: MATLAB ist fast with vectors and matrices but it's usually very slow in loops. You should always try to work with the whole vector (or matrice) if possible. There are various ways to do this.
One way is logical indexing. I think this should work for a part of your Problem. At

first I will show you a little example how logical indexing works:

vector=[0 4 5 2 4]

logicalIndex=(vector==4) %the type of logicalIndex is bool!

excerpOfVector=vector(logicalIndex) %some other ways to use logial Indexing 
excerpOfVectorSecondVariation=zeros(1,length(vector)) 
excerpOfVectorSecondVariation(logicalIndex)=vector(logicalIndex)
vector(vector < 5) = 11;            %implicit use of logical indexing

You should change your messy while loops into for loops! ;-)

Upvotes: 1

grantnz

Reputation: 7433

The code below runs about 200x faster than your original code and gives the same result.

Of course, the speed-up is dependant on the distribution of the input data and my assumptions may be incorrect (I have 1000 unique IDs and on average 19 records per day).

I've also written some code to generate data similar to what I believe your input data is.

% Generate Input data 
ep = 100000;

isRepeatedDay = rand(1,ep) < 0.95;
day = cumsum(~isRepeatedDay);
unique_ids = 1:1000;
unique_id_indices = round(rand(ep,1)*length(unique_ids));
unique_id_indices(unique_id_indices < 1) = 1;
unique_id_indices(unique_id_indices > length(unique_id_indices) ) = length(unique_id_indices);

unique_id = unique_ids(unique_id_indices);

%Process the input data to find repeats
tic
repeat = zeros(ep,1);
[unique_values,~,indices] = unique(unique_id);
for uv_index = 1:length(unique_values)
    uv = unique_values(uv_index);
    uv_indices = find(indices == uv_index);
    for i=1:length(uv_indices)-1
        daysDifference = day(uv_indices(i+1)) - day(uv_indices(i));
        if daysDifference <= 179
            repeat(uv_indices(i),1) = 1;
        end
    end
end
toc

Upvotes: 2

Dennis Jaheruddin

Reputation: 21561

Here is the way I would do it if unique_id can have many different values (and maybe even if it doesn't).

The operation takes under 5 seconds on my system:

x = round(rand(1000000,1)*10);
result = zeros(size(x));
windowsize = 180;
for t = 1:(numel(x)-windowsize)
    result(t) = sum(x(t+1:t+windowsize)==x(t));
end

I think this is what you need, make sure to check whether you want to look 'forward' or 'backward'.

Upvotes: 1

Dr. Andrew Burnett-Thompson

Reputation: 21561

if not already doing this, ensure you allocate all memory up-front where possible. I've seen Matlab scripts go from 24-hours to 8 minutes by doing this.

Use the zeros function to preallocate memory for all growing arrays (day, unique_id, repeat, day_final and repeat_final).

x = zeros(1000); %// Creates a 1000 element array of all zeros

Upvotes: 3

Improving MATLAB code speed

Answers (4)

Related Questions