Nakano
Nakano

Reputation: 45

Replace multiple substrings using strrep in Matlab

I have a big string (around 25M characters) where I need to replace multiple substrings of a specific pattern in it.

Frame 1
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0

...........

Frame 2
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0

...........

Frame 7670
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0

...........

The substring I need to remove is the 'Frame #' and it occurs around 7670 times. I can give multiple search strings in strrep, using a cell array

strrep(text,{'Frame 1','Frame 2',..,'Frame 7670'},';')

However that returns a cell array, where in each cell, I have the original string with the corresponding substring of one of my input cell changed.

Is there a way to replace multiple substrings from a string, other than using regexprep? I noticed that it is considerably slower than strrep, that's why I am trying to avoid it.

With regexprep it would be:

regexprep(text,'Frame \d*',';')

and for a string of 25MB it takes around 47 seconds to replace all the instances.

EDIT 1: added the equivalent regexprep command

EDIT 2: added size of the string for reference, number of occurences for the substring and timing of execution for the regexprep

Upvotes: 2

Views: 9534

Answers (3)

horchler
horchler

Reputation: 18484

I think that this can be done using only textscan, which is known to be very fast. Be specifying a 'CommentStyle' the 'Frame #' lines are stripped out. This may only work because these 'Frame #' lines are on their own lines. This code returns the raw data as one big vector:

s = textscan(text,'%f','CommentStyle','Frame','Delimiter',',');
s = s{:}

You may want to know how many elements are in each frame or even reshape the data into a matrix. You can use textscan again (or before the above) to get just the data for the first frame:

f1 = textscan(text,'%f','CommentStyle','Frame 1','Delimiter',',');
f1 = s{:}

In fact, if you just want the elements from the first line, you can use this:

l1 = textscan(text,'%f,','CommentStyle','Frame 1')
l1 = l1{:}

However, the other nice thing about textscan is that you can use it to read in the file directly (it looks like you may be using some other means currently) using just fopen to get an FID. Thus the string data text doesn't have to be in memory.

Upvotes: 1

Nakano
Nakano

Reputation: 45

Ok, in the end I found a way to go around the problem. Instead of using regexprep to change the substring, I remove the 'Frame ' substring (including whitespace, but not the number)

rawData = strrep(text,'Frame ','');

This results in something like this:

1
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0

...........

2
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0

...........

7670
0,0,0,0,0,1,2,34,0
0,1,2,3,34,12,3,4,0

...........

Then, I change all the commas (,) and newline characters (\n) into a semicolon (;), using again strrep, and I create a big vector with all the numbers

rawData = strrep(rawData,sprintf('\r\n'),';');
rawData = strrep(rawData,';;',';');
rawData = strrep(rawData,';;',';');
rawData = strrep(rawData,',',';');
rawData = textscan(rawData,'%f','Delimiter',';');

then I remove the unnecessary numbers (1,2,...,7670), since they are located at a specific point in the array (each frame contains a specific amount of numbers).

rawData{1}(firstInstance:spacing:lastInstance)=[];

And then I go on with my manipulations. It seems that the additional strrep and removal of the values from the array is much much faster than the equivalent regexprep. With a string of 25M chars with regexprep I can do the whole operation in about 47", while with this workaround it takes only 5"!

Hope this helps somehow.

Upvotes: 2

Luis Mendo
Luis Mendo

Reputation: 112659

Using regular expressions:

result = regexprep(text,'Frame [0-9]+','');

It's possible to avoid regular expressions as follows. I use strrep with suitable replacement strings that act as masks. The obtained strings are equal-length and are assured to be aligned, and can thus be combined into the final result using the masks. I've also included the ; you want. I don't know if it will be faster than regexprep or not, but it's definitely more fun :-)

% Data
text = 'Hello Frame 1 test string Frame 22 end of Frame 2 this'; %//example text
rep_orig = {'Frame 1','Frame 2','Frame 22'}; %//strings to be replaced.
%//May be of different lengths

% Computations    
rep_dest = cellfun(@(s) char(zeros(1,length(s))), rep_orig, 'uni', false);
%//series of char(0) of same length as strings to be replaced (to be used as mask)
aux = cell2mat(strrep(text,rep_orig.',rep_dest.'));
ind_keep = all(double(aux)); %//keep characters according to mask
ind_semicolon = diff(ind_keep)==1; %//where to insert ';' 
ind_keep = ind_keep | [ind_semicolon 0]; %// semicolons will also be kept
result = aux(1,:); %//for now
result(ind_semicolon) = ';'; %//include `;`
result = result(ind_keep); %//remove unwanted characters

With these example data:

>> text

text =

Hello Frame 1 test string Frame 22 end of Frame 2 this

>> result

result =

Hello ; test string ; end of ; this

Upvotes: 1

Related Questions