Reputation: 2534
I am parsing a data file in Matlab to plot it. The data I'm parsing has 3 columns:
category | x | y
The file may have several points for any category. In order to plot them together and get the legend right, I need to sort the data such that each category results in a n x 2
matrix, which represents n
points in a given category.
In the following excerpt, don't worry about the parsing, but instead the important part is how the map is appended. category
is the key, and newVal
is a single x-y pair value that needs to be concatenated to that category
's matrix of points.
disp('Sorting categories...')
map = containers.Map();
for i = 2:size(data,2)
str = strsplit(data{i}, '\t');
category = strsplit(str{1}, '.');
category = category{1};
newVal=sscanf([str{2} ',' str{3}],'%f,%f')';
%Interesting stuff starts here
if(isKey(map, category))
mapVal = [map(category); newVal];
else
mapVal = newVal;
end
map = [map; containers.Map(category, mapVal)];
end
This takes way too long with 59k points, and I would really like to change the for
into parfor
. My process requires a read-modify-write on the map
variable, which isn't going to work. I'd like the code to do something like this instead (assuming it would be faster, which may not be true anyway):
disp('Sorting categories...')
map = containers.Map();
for i = 2:size(data,2)
str = strsplit(data{i}, '\t');
category = strsplit(str{1}, '.');
category = category{1};
newVal=sscanf([str{2} ',' str{3}],'%f,%f')';
maps{i} = containers.Map(category, mapVal);
end
map = [maps{:}];
However, concatenating maps in Matlab causes values to be overwritten rather than appended. This will result in each category only holding the very last point that was parsed for that category. Is there a way to avoid this behavior?
Upvotes: 0
Views: 148
Reputation: 53
Unfortunately I cannot comment on the accepted answer, as I would of rather inserted this there. I found this question to be a great learning experience since I've never used the maps.containers
or accumarray
function. I was intrigued by accumarray
and attempted to apply it to one of my own problems but I found it to be much slower. In that case I had a vector with 5e6 elements and I was grouping 8e5 categories and taking the L2 norm of the values. I then decided to apply a similar approach to the problem here. I found not using accumarray
to be a hair faster. Here is what I did,
tic;
xyByCategory = cell(numel(allCats),1);
for kk = 1:numel(allCats)
xyByCategory{kk} = xyData(strcmp(categoryData,allCats{kk}),:);
end
toc
using N = 6e6
accumarray
approach in the answer,
Elapsed time is 0.611510 seconds.
The above approach,
Elapsed time is 0.429321 seconds.
Upvotes: 0
Reputation: 25160
Although I love to promote PARFOR and PARFEVAL, I really think this is a job for ACCUMARRAY. Here's something pretty close to your problem. In the first chunk, I synthesize some data - in your real case, I suggest you aim to read all the data from a file in one (perhaps using TEXTSCAN). Anyway, once you've done that, the idea is to convert the categories into a numeric form, and then call accumarray
to collect together results in the same category. accumarray
is something of a tricky beast to master, so perhaps someone more skilled in the art can figure out a better way than the two calls I'm making - but on my machine, the accumarray
piece runs in 0.02 seconds...
% Generate some data rather than reading from a file
N = 59000;
allCats = {'foo', 'bar', 'baz'};
categoryData = allCats(randi([1,numel(allCats)], N, 1));
xyData = rand(N, 2);
% work out which category each row is in (in your case, you could
% first generate allCats using "unique(categoryData)"
[allCats, ~, whichCategory] = unique(categoryData);
% Use accumarray to build up one cell array for each category, once
% each for the x and y data
tic
xByCategory = accumarray(whichCategory, xyData(:,1), [], @(x){x});
yByCategory = accumarray(whichCategory, xyData(:,2), [], @(x){x});
toc
Upvotes: 1
Reputation: 53
I'm not sure how much time is too long. This script took me to 52s to complete using 6 workers. 22sec for the parfor and 29sec for the merge loop. Its not elegant, but it does apply a parfor loop. You can further speed this up using parfeval
and the fetchnext
functions. fetchnext
will allow you to merge your containers as they become available, thereby performing the merge while workers still build your map containers. I'm guessing this would translate into roughly 30-ish seconds vs the 52s.
Note, parfeval
and fetchnext
are only available in 2013b and above.
% Create some test data
N = 59000;
aa(:,1) = 10*rand(N,1);
% aa(:,1) = [1 2 2 4 5 6 6 8 9 10];
aa(:,2) = rand(N,1);
aa(:,3) = rand(N,1);
data = strtrim(cellstr(num2str(aa,'%g,%g,%g'))');
ndata = size(data,2);
p = gcp;
if isempty(p)
NumWorkers = 1;
else
NumWorkers = p.NumWorkers;
end
tic;
% work out the indices for each worker
numchunks = ceil(ndata/NumWorkers);
for kk = 1:numchunks
strt = (kk-1)*NumWorkers+1;
endl = kk*NumWorkers;
if endl > ndata
endl = ndata;
end
ind{kk} = strt:endl;
end
maps = cell(0);
parfor jj = 1:numel(ind)
% disp('Sorting categories...')
maps{jj,1} = containers.Map();
for i = 1:numel(ind{jj})
str = strsplit(data{ind{jj}(i)}, ',');
category = strsplit(str{1}, '.');
category = category{1};
newVal=sscanf([str{2} ',' str{3}],'%f,%f')';
%Interesting stuff starts here
if(isKey(maps{jj,1}, category))
mapVal = [maps{jj,1}(category); newVal];
else
mapVal = newVal;
end
maps{jj,1} = [maps{jj,1}; containers.Map(category, mapVal)];
end
end
% merge your map containers one-by-one
for kk = 2:numel(maps)
auxkeys = maps{1}.keys;
currkeys = maps{2}.keys;
[ia,ib] = ismember(currkeys,auxkeys);
if any(ia)
ia = find(ia);
for mm = 1:numel(ia)
maps{1}(auxkeys{ib(ia(mm))}) = [maps{1}(auxkeys{ib(ia(mm))}); maps{2}(currkeys{ia(mm)})];
remove(maps{2},currkeys{ia(mm)});
end
end
maps{1} = [maps{1}; maps{2}];
maps(2) = [];
end
toc
Upvotes: 1