tstenner
tstenner

Reputation: 10311

Split string into cell array by positions

I have a file with strings of a known length, but no separator.

% What should be the result
vals = arrayfun(@(x) ['Foobar ', num2str(x)], 1:100000, 'UniformOutput', false);

% what the file looks like when read in
strs = cell2mat(vals);
strlens = cellfun(@length, vals);

The most straightforward approach is quite slow:

out = cell(1, length(strlens));
for i=1:length(strlens)
    out{i} = fread(f, strlens(i), '*char');
end % 5.7s

Reading everything in and splitting it up afterwards is a lot faster:

strs = fread(f, sum(strlens), '*char');
out = cell(1, length(strlens));
slices = [0, cumsum(strlens)];
for i=1:length(strlens)
    out{i} = strs(slices(i)+1:slices(i+1));
end % 1.6s

With a mex function I can get down to 0.6s, so there's still a lot of room for improvement. Can I get comparable performance with pure Matlab (R2016a)?

Edit: the seemingly perfect mat2cell function doesn't help:

out = mat2cell(strs, 1, strlens); % 2.49s

Upvotes: 1

Views: 90

Answers (1)

Andrew Janke
Andrew Janke

Reputation: 23908

Your last approach – reading everything at once and splitting it up afterwards – looks pretty optimal to me, and is how I do stuff like this.

For me, it's running in about 80 ms seconds when the file is on a local SSD in both R2016b and R2019a, on Mac.

function out = scratch_split_strings(strlens)
%
% Example:
% in_strs = arrayfun(@(x) ['Foobar ', num2str(x)], 1:100000, 'UniformOutput', false);
% strlens = cellfun(@length, in_strs);
% big_str = cat(2, in_strs{:});
% fid = fopen('text.txt'); fprintf(fid, '%s', big_str); fclose(fid);
% scratch_split_strings(strlens);

t0 = tic;
fid = fopen('text.txt');
txt = fread(fid, sum(strlens), '*char');
fclose(fid);
fprintf('Read time: %0.3f s\n', toc(t0));

str = txt;
t0 = tic;
out = cell(1, length(strlens));
slices = [0, cumsum(strlens)];
for i = 1:length(strlens)
    out{i} = str(slices(i)+1:slices(i+1))';
end
fprintf('Munge time: %0.3f s\n', toc(t0));

end
>> scratch_split_strings(strlens);
Read time: 0.002 s
Munge time: 0.075 s

Have you stuck it in the profiler to see what's taking up your time here?

As far as I know, there is no faster way to split up a single primitive array into variable-length subarrays with native M-code. You're doing it right.

Upvotes: 2

Related Questions