Reputation: 1001
I have a very long string, I'd like to find all the NaN values and replace it 'Null'. This long string was converted from a 120 x 150000 cell. The reason for converting it into a long string was I was going to convert it into a one giant SQL query as fastinsert and datainsert can can be very slow and sometimes I'm running out of heap space. The idea is to do the following
exec(sqlConnection, long_string)
I tried using the regexpreop to replace NaN with null but it seems very slow. Is there an alternative way.
long_string = regexprep(long_string,'NaN','null');
Upvotes: 1
Views: 466
Reputation: 1860
As Floris mentioned regexp
is a very strong command and as a result is slower than other find commands.
In addition to Floris suggestion you can try using strrep
which works in your case, since you are not using any of the special powers of regexp
.
Here is an example:
str = char('A' + rand(1,120 * 15000)*('z'-'A'));
tic
str2 = strrep(str, 'g', 'null');
disp('strrep: '), toc
tic
str3 = regexprep(str, 'g','null');
disp('regexprep: '), toc
On my computer it will return:
strrep:
Elapsed time is 0.004640 seconds.
regexprep:
Elapsed time is 4.004671 seconds.
Upvotes: 5
Reputation: 46375
regex
is very powerful, but can be slow because of its flexibility. If you still have the original cell array - and assuming it contains only strings - the following line of code should work, and very fast:
cellArray{find(ismember(cellArray,'NaN'))} = 'null';
ismember
finds all the elements in cellArray
that are NaN
, returning a boolean array with the same shape as cellArray
; the find
operation turns these into indices of the elements that are NaN
, and then you just assign the value null
to all of them.
It must be said that 120 x 150,000 is a VERY large cell array - it will occupy over 2 GB even with just a single character in each cell (I know this because I just created a 120x15000 cell array, and it was 205,500,000 bytes). It might be worth processing this in smaller chunks rather than all at once. Especially if you know that the NaN
would occur only in some of the columns, for example.
Processing a GB sized string, especially when you can't operate in-place (you are changing the size of the string with every replacement, and it's getting longer, not shorter) is going to be dead slow. It's possible you could write a short mex
function to do this if you really have no other option - that could be pretty fast.
Upvotes: 2