Reputation: 3904
Would be grateful for some pointers. I am reading about 1M rows of data and it is taking almost 24 hours with the following code. How can I improve the execution time?
The array Day
contains the value of the nth day from the start and there are more then one record for a particular day. The program checks if a particular id (stored in unique_id
) is repeated within 180 days.
%// calculating the number of repeats within 180 days
fid2 = 'data_050913/Unique_id_repeat_count1.xlsx';
fid1 = 'data_050913/data_050913_2000.csv';
fid_data = fopen(fid1);
data = fgetl(fid_data); %// the first line, title line
ep = 0; %// position point number
while 1
data = fgetl(fid_data);
if(length(data)<10)
break;
end
ep = ep+1;
id = find(data == ',');
unique_id(ep) = str2num(data(1:id(1)-1));
day(ep) = str2num(data(id(8)+1:id(9)-1));
end
repeat = zeros(ep,1);
tic
i = 1;
count = 0;
while i <= ep
j = i+1;
while ( (j<=ep) && (day(j)<= day(i)+179) )
if unique_id(i) == unique_id(j)
count = 1;
break;
end
j = j+1;
end
repeat(i,1) = count;
count = 0;
i = i+1;
end
toc
i = 1;
k = 1;
while i<=ep
count = repeat(i,1);
j=i;
while (day(j) == day(i))
count = repeat(j,1)+count;
j = j+1;
if j > ep
break;
end
end
day_final(k,1)= day(i);
repeat_final(k,1) = count;
k = k+1;
i = j;
end
xlswrite(fid2,day_final,'Repeat_Count','A2');
xlswrite(fid2,repeat_final,'Repeat_Count','B2');
Thanks
Upvotes: 5
Views: 202
Reputation: 1900
first I will show you a little example how logical indexing works:
vector=[0 4 5 2 4]
logicalIndex=(vector==4) %the type of logicalIndex is bool!
excerpOfVector=vector(logicalIndex) %some other ways to use logial Indexing
excerpOfVectorSecondVariation=zeros(1,length(vector))
excerpOfVectorSecondVariation(logicalIndex)=vector(logicalIndex)
vector(vector < 5) = 11; %implicit use of logical indexing
Upvotes: 1
Reputation: 7433
The code below runs about 200x faster than your original code and gives the same result.
Of course, the speed-up is dependant on the distribution of the input data and my assumptions may be incorrect (I have 1000 unique IDs and on average 19 records per day).
I've also written some code to generate data similar to what I believe your input data is.
% Generate Input data
ep = 100000;
isRepeatedDay = rand(1,ep) < 0.95;
day = cumsum(~isRepeatedDay);
unique_ids = 1:1000;
unique_id_indices = round(rand(ep,1)*length(unique_ids));
unique_id_indices(unique_id_indices < 1) = 1;
unique_id_indices(unique_id_indices > length(unique_id_indices) ) = length(unique_id_indices);
unique_id = unique_ids(unique_id_indices);
%Process the input data to find repeats
tic
repeat = zeros(ep,1);
[unique_values,~,indices] = unique(unique_id);
for uv_index = 1:length(unique_values)
uv = unique_values(uv_index);
uv_indices = find(indices == uv_index);
for i=1:length(uv_indices)-1
daysDifference = day(uv_indices(i+1)) - day(uv_indices(i));
if daysDifference <= 179
repeat(uv_indices(i),1) = 1;
end
end
end
toc
Upvotes: 2
Reputation: 21561
Here is the way I would do it if unique_id
can have many different values (and maybe even if it doesn't).
The operation takes under 5 seconds on my system:
x = round(rand(1000000,1)*10);
result = zeros(size(x));
windowsize = 180;
for t = 1:(numel(x)-windowsize)
result(t) = sum(x(t+1:t+windowsize)==x(t));
end
I think this is what you need, make sure to check whether you want to look 'forward' or 'backward'.
Upvotes: 1
Reputation: 21561
if not already doing this, ensure you allocate all memory up-front where possible. I've seen Matlab scripts go from 24-hours to 8 minutes by doing this.
Use the zeros
function to preallocate memory for all growing arrays (day
, unique_id
, repeat
, day_final
and repeat_final
).
x = zeros(1000); %// Creates a 1000 element array of all zeros
Upvotes: 3