Old_Mortality
Old_Mortality

Reputation: 519

Why does my matlab program use so much memory?

I am writing a matlab program, which reads about 500 files. Each file has 20,000 lines, with 1 number on each line. The program tries to build a matrix of 20,000 * 500 with these numbers. The numbers are stored as Double, so 8 bytes per number. So I would expect this to take 20,000 * 500 * 8 bytes, which is approximately 1E8, i.e. 100MB. And yet this program exhausts my 16GB memory. As the program runs, I see the memory use steadily going up, GB by GB. I am using Matlab R2015b on Ubuntu 14.04.

What's happening? Many thanks for your attention.

Here is the full code

clear all;
% number of rna bits in the file
filesize = 20532

maxFiles = 480;
rnaCounts = NaN(filesize,maxFiles);

myFolder = '~/_STATS/data3/RNASeqV2/UNC__IlluminaHiSeq_RNASeqV2/Level_3';
filePattern = fullfile(myFolder, '*genes.normalized_results');

theFiles = dir(filePattern);

rnaCounts = NaN(filesize,length(theFiles));


for k = 1 : length(theFiles) 
    mrnaFilename = strtrim(theFiles(k).name);
    fprintf(1, 'Now reading mrnaFile %d %s  \n', k, mrnaFilename);

    % read rna file
    fullFileName = fullfile(myFolder, mrnaFilename);
    rnafid = fopen(fullFileName);

    if rnafid < 0 
       fprintf('====ERROR OPENING RNA FILE =====================');
    end
    rnaline = fgets(rnafid);

    lc = 1;  % line counter
    while ischar(rnaline) && feof(rnafid) ~= 1
       rnaline = fgets(rnafid);
       rnaSplit = strsplit(rnaline);

       % write to the matrix
       rnaCounts(lc,k) = str2num(rnaSplit{2});

       lc = lc + 1;
    end
    fclose(rnafid);

end

Upvotes: 6

Views: 483

Answers (2)

hbaderts
hbaderts

Reputation: 14316

Often, high-level I/O functions, such as dlmread or textscan are useful to read such text formats. Use dlmread if you have only numeric data, and textscan for more complex formats.

The sample data you provided is:

A2LD1|87769 135.5735

As you only need the number in the second column and discard the identifier in the first column, all you have is numeric data, and you can use dlmread.

data = dlmread(fullFileName, '\t', 1, 1);

The \t is to specify that the delimiter (column separator) is a Tab. The two 1s are to specify a row offset and a column offset, i.e. ignore the first row (the header) and the first column (id) of the file.

Upvotes: 1

drhagen
drhagen

Reputation: 9532

As verified by the OP, the str2num function in the Linux version of Matlab 2015b has a memory leak. This function is not very useful anyway as it is designed to parse strings representing entire matrices (1 2; 3 4) rather than the typical use case of parsing a single number (1.234). Use str2double when doing simple number parsing; it is faster even when str2num isn't broken.

It is likely that using a different version of Matlab would also work around the problem, because in my experience, these kinds of memory bugs don't usually persist from one version to the next.

Upvotes: 3

Related Questions