Reputation: 1214
I have a text-file which is a huge set of data(around 9 GB). I have arranged the file as 244 X 3089987 with data delimited with tabs. I would like to load this text-file in Matlab as a matrix. Here is what I have tried and I have been unsuccessful (My Matlab gets hung).
fread = fopen('merge.txt','r');
formatString = repmat('%f',244,3089987);
C = textscan(fread,formatString);
Am I doing something wrong or is my approach wrong? If this is easily possible in Python, could someone please suggest accordingly.
Upvotes: 0
Views: 645
Reputation: 25140
Another option in recent MATLAB releases is to use datastore
. This has the advantage of being designed to allow you to page through the data, rather than read the whole lot at once. It can generally deduce all the formatting stuff.
Upvotes: 1
Reputation: 573
I'm surprised this is even trying to run, when I try something similar textscan throws an error.
If you really want to use textscan you only need the format for each row so you can replace 244 in your code with 1 and it should work. Edit: having read your comment not that in the first element is the number of columns so you should do formatString = repmat('%f',1, 244);
. Also you can apparently just leave the format as empty (''
) and it will work.
However, Matlab has several text import functions of which textscan is rarely the easiest way to do something.
In this case I would probably use dlmread, which does any delimitated numerical data. You want something like:
C=dlmread('merge.txt', '\t');
Also as you are trying to load 9GB of data I assume you have enough memory, you'll probably get an out of memory error if you don't but it is something to consider.
Upvotes: 0
Reputation: 12214
If you read the documentation for textscan
you will see that you can define an input argument N
so that:
textscan reads file data using the formatSpec N times, where N is a positive integer. To read additional data from the file after N cycles, call textscan again using the original fileID. If you resume a text scan of a file by calling textscan with the same file identifier (fileID), then textscan automatically resumes reading at the point where it terminated the last read.
You can also pass a blank formatSpec
to textscan
in order to read in an arbitrary number of columns. This is how dlmread
, a wrapper for textscan
operates.
For example:
fID = fopen('test.txt');
chunksize = 10; % Number of lines to read for each iteration
while ~feof(fID) % Iterate until we reach the end of the file
datachunk = textscan(fID, '', chunksize, 'Delimiter', '\t', 'CollectOutput', true);
datachunk = datachunk{1}; % Pull data out of cell array. Can take time for large arrays
% Do calculations
end
fclose(fID);
This will read in 10 line chunks until you reach the end of the file.
If you have enough RAM to store the data (a 244 x 3089987
array of double
is just over 6 gigs) then you can do:
mydata = textscan(fID, '', 'Delimiter', '\t', 'CollectOutput', true);
mydata = mydata{1}; % Pull data out of cell array. Can take time for large arrays
Upvotes: 3
Reputation: 119
try:
A = importdata('merge.txt', '\t');
http://es.mathworks.com/help/matlab/ref/importdata.html
and if the rows are not delimited by '\n'
:
[C, remaining] = vec2mat(A, 244)
http://es.mathworks.com/help/comm/ref/vec2mat.html
Upvotes: 1