Reputation: 519
I am writing a matlab program, which reads about 500 files. Each file has 20,000 lines, with 1 number on each line. The program tries to build a matrix of 20,000 * 500 with these numbers. The numbers are stored as Double, so 8 bytes per number. So I would expect this to take 20,000 * 500 * 8 bytes, which is approximately 1E8, i.e. 100MB. And yet this program exhausts my 16GB memory. As the program runs, I see the memory use steadily going up, GB by GB. I am using Matlab R2015b on Ubuntu 14.04.
What's happening? Many thanks for your attention.
Here is the full code
clear all;
% number of rna bits in the file
filesize = 20532
maxFiles = 480;
rnaCounts = NaN(filesize,maxFiles);
myFolder = '~/_STATS/data3/RNASeqV2/UNC__IlluminaHiSeq_RNASeqV2/Level_3';
filePattern = fullfile(myFolder, '*genes.normalized_results');
theFiles = dir(filePattern);
rnaCounts = NaN(filesize,length(theFiles));
for k = 1 : length(theFiles)
mrnaFilename = strtrim(theFiles(k).name);
fprintf(1, 'Now reading mrnaFile %d %s \n', k, mrnaFilename);
% read rna file
fullFileName = fullfile(myFolder, mrnaFilename);
rnafid = fopen(fullFileName);
if rnafid < 0
fprintf('====ERROR OPENING RNA FILE =====================');
end
rnaline = fgets(rnafid);
lc = 1; % line counter
while ischar(rnaline) && feof(rnafid) ~= 1
rnaline = fgets(rnafid);
rnaSplit = strsplit(rnaline);
% write to the matrix
rnaCounts(lc,k) = str2num(rnaSplit{2});
lc = lc + 1;
end
fclose(rnafid);
end
Upvotes: 6
Views: 483
Reputation: 14316
Often, high-level I/O functions, such as dlmread
or textscan
are useful to read such text formats. Use dlmread
if you have only numeric data,
and textscan
for more complex formats.
The sample data you provided is:
A2LD1|87769 135.5735
As you only need the number in the second column and discard the identifier in the first column, all you have is numeric data, and you can use dlmread
.
data = dlmread(fullFileName, '\t', 1, 1);
The \t
is to specify that the delimiter (column separator) is a Tab. The two 1
s are to specify a row offset and a column offset, i.e. ignore the first row (the header) and the first column (id) of the file.
Upvotes: 1
Reputation: 9532
As verified by the OP, the str2num
function in the Linux version of Matlab 2015b has a memory leak. This function is not very useful anyway as it is designed to parse strings representing entire matrices (1 2; 3 4
) rather than the typical use case of parsing a single number (1.234
). Use str2double
when doing simple number parsing; it is faster even when str2num
isn't broken.
It is likely that using a different version of Matlab would also work around the problem, because in my experience, these kinds of memory bugs don't usually persist from one version to the next.
Upvotes: 3