Reputation: 10311
I have a file with strings of a known length, but no separator.
% What should be the result
vals = arrayfun(@(x) ['Foobar ', num2str(x)], 1:100000, 'UniformOutput', false);
% what the file looks like when read in
strs = cell2mat(vals);
strlens = cellfun(@length, vals);
The most straightforward approach is quite slow:
out = cell(1, length(strlens));
for i=1:length(strlens)
out{i} = fread(f, strlens(i), '*char');
end % 5.7s
Reading everything in and splitting it up afterwards is a lot faster:
strs = fread(f, sum(strlens), '*char');
out = cell(1, length(strlens));
slices = [0, cumsum(strlens)];
for i=1:length(strlens)
out{i} = strs(slices(i)+1:slices(i+1));
end % 1.6s
With a mex function I can get down to 0.6s, so there's still a lot of room for improvement. Can I get comparable performance with pure Matlab (R2016a)?
Edit: the seemingly perfect mat2cell
function doesn't help:
out = mat2cell(strs, 1, strlens); % 2.49s
Upvotes: 1
Views: 90
Reputation: 23908
Your last approach – reading everything at once and splitting it up afterwards – looks pretty optimal to me, and is how I do stuff like this.
For me, it's running in about 80 ms seconds when the file is on a local SSD in both R2016b and R2019a, on Mac.
function out = scratch_split_strings(strlens)
%
% Example:
% in_strs = arrayfun(@(x) ['Foobar ', num2str(x)], 1:100000, 'UniformOutput', false);
% strlens = cellfun(@length, in_strs);
% big_str = cat(2, in_strs{:});
% fid = fopen('text.txt'); fprintf(fid, '%s', big_str); fclose(fid);
% scratch_split_strings(strlens);
t0 = tic;
fid = fopen('text.txt');
txt = fread(fid, sum(strlens), '*char');
fclose(fid);
fprintf('Read time: %0.3f s\n', toc(t0));
str = txt;
t0 = tic;
out = cell(1, length(strlens));
slices = [0, cumsum(strlens)];
for i = 1:length(strlens)
out{i} = str(slices(i)+1:slices(i+1))';
end
fprintf('Munge time: %0.3f s\n', toc(t0));
end
>> scratch_split_strings(strlens);
Read time: 0.002 s
Munge time: 0.075 s
Have you stuck it in the profiler to see what's taking up your time here?
As far as I know, there is no faster way to split up a single primitive array into variable-length subarrays with native M-code. You're doing it right.
Upvotes: 2