asboans
asboans

Reputation: 161

dlmread returning a single column for large text files

I'm trying to read in a large file with dlmread, but it just treats the whole file as one long column. The file is written in Java with the following code:

public void writeToFile(double[] arr) throws IOException{

FileWriter write = new FileWriter(path, append);
PrintWriter print_line = new PrintWriter(write);

for(int i=0; i<arr.length; i++){
  print_line.printf("%f\t", arr[i]);   
}
print_line.printf("\n");

print_line.close();

}

and my MATLAB script reads in the file like: [DATA] = dlmread('probability_cyclelength.dat'); giving:

>>size(DATA)

ans =
         2000000        1 

There are 2000000 data in total, with up to 60,000 per row (but not the same number in each row - shouldn't matter)

When I try it with a smaller dataset (100000 data), it works absolutely fine. I don't know if the problem is in the Java or the MATLAB, so I really need some help, thanks!

Upvotes: 4

Views: 3387

Answers (1)

slayton
slayton

Reputation: 20319

By default dlmread tries to infer the delimiter from the file, by default it uses a white space as a delimiter.

The only way I was able to replicate the problem you describe was by specifying ' ' as the delimiter. Are you sure you aren't doing this?

Try making this change and see if it fixes your problem.

data = dlmread(inFile, '\t');

If that doesn't fix your problem then I suspect that the problem is arising from the fact that the rows in your text file have a different number of columns. For example if you use dlmread to open a text file containing:

1 2 3 4
5

dlmread returns a matrix like this:

1 2 3 4
5 0 0 0

This representation is wasteful as it is using 64 bytes (8 bytes per double * 8 double) to store 40 bytes of information.

It could be that with these empty positions that a matrix representation of your file is simply too big, and so dlmread is returning your a vector instead to save memory.

You can work around this though. If you only need a few rows at a time you can load a collection of rows from the file by specifying a range to dlmread. Note for this to work you have to know the maximum number of columns in the file, as dlmread won't let you read more than that number of columns.

r = [0 4]; %load the first 5 rows
maxC = 10; % load up to 10 columns
data = dlmread(inFile, '\t', [r(1), 0, r(2), maxX]);

You could then loop through the file loading the rows of interest, but you probably can't load them all into a matrix due to the memory constraints I mentioned earlier.

If you need the entire dataset in memory then you should consider loading each row individually and saving them into a cell array. It takes a bit more work to get everything loaded but you could do that with something like this:

% open the file
fid = fopen(fileName); 
% load each line as a single string
tmp = textscan(fid, '%s', 'delimiter', '\n'); 
% textscan wraps its results in a cell, remove that wrapping
rawText = tmp{1}; 
nLines = numel(rawText);

%create a cell array to store the processed string
data = cell(nLines, 1);
for i = 1:nLines
  %scan a line of text returning a vector of doubles
  tmp = textscan(rawText{i}, '%f');
  data{i} = tmp{1}; 
end

Upvotes: 6

Related Questions