VSB
VSB

Reputation: 10375

read csv file in matlab with rows of different size

I'm going to read a big csv file in matlab which contains rows like this:

1, 0, 1, 0, 1
1, 0, 1, 0, 1, 0, 1, 0, 1
1, 0, 1
1, 0, 1
1, 0, 1, 0, 1
0, 1, 0, 1, 0, 1, 0, 1, 0

For reading big files I'm using textscan however I should define number of expected parameters in each line of text file.

Using csvread helps but it is too slow and seems to be not efficient. Are there any methods to use textscan with uknown number of inputs in each line? or do you have any other suggestion for this situation?

Upvotes: 1

Views: 649

Answers (1)

Hoki
Hoki

Reputation: 11792

Since you said "Numerical matrix padded with zeros would be good", there is a solution using textscan which can give you that. The catch however is you have to know the maximum number of element a line can have (i.e. the longest line in your file).

Provided you know that, then a combination of the additional parameters for textscan allow you to read an incomplete line:

If you set the parameter 'EndOfLine','\r\n', the documentation explains:

If there are missing values and an end-of-line sequence at the end of the last line in a file, then textscan returns empty values for those fields. This ensures that individual cells in output cell array, C, are the same size.

So with the example data in your question saved as differentRows.txt, the following code:

% be sure about this, better to overestimate than underestimate
maxNumberOfElementPerLine = 10 ;

% build a reading format which can accomodate the longest line
readFormat = repmat('%f',1,maxNumberOfElementPerLine) ;

fidcsv = fopen('differentRows.txt','r') ;

M = textscan( fidcsv , readFormat , Inf ,...
    'delimiter',',',...
    'EndOfLine','\r\n',...
    'CollectOutput',true) ;

fclose(fidcsv) ;
M = cell2mat(M) ; % convert to numerical matrix

will return:

>> M
M =
     1     0     1     0     1   NaN   NaN   NaN   NaN   NaN
     1     0     1     0     1     0     1     0     1   NaN
     1     0     1   NaN   NaN   NaN   NaN   NaN   NaN   NaN
     1     0     1   NaN   NaN   NaN   NaN   NaN   NaN   NaN
     1     0     1     0     1   NaN   NaN   NaN   NaN   NaN
     0     1     0     1     0     1     0     1     0   NaN

As an alternative, if it makes a significant speed difference, you could import your data into integers instead of double. The trouble with that is NaN is not defined for integers, so you have 2 options:

  • 1) Leave the empty entries to the default 0

just replace the line which define the format specifier with:

% build a reading format which can accomodate the longest line
readFormat = repmat('%d',1,maxNumberOfElementPerLine) ;

This will return:

>> M
M =
1   0   1   0   1   0   0   0   0   0
1   0   1   0   1   0   1   0   1   0
1   0   1   0   0   0   0   0   0   0
1   0   1   0   0   0   0   0   0   0
1   0   1   0   1   0   0   0   0   0
0   1   0   1   0   1   0   1   0   0

  • 2) Replace the empty entries with a placeholder (for ex: 99)

Define a value which you are sure you'll never have in your original data (for quick identification of empty cells), then use the EmptyValue parameter of the textscan function:

readFormat = repmat('%d',1,maxNumberOfElementPerLine) ;
DefaultEmptyValue = 99 ; % placeholder for "empty values"

fidcsv = fopen('differentRows.txt','r') ;
M = textscan( fidcsv , readFormat , Inf ,...
    'delimiter',',',...
    'EndOfLine','\r\n',...
    'CollectOutput',true,...
    'EmptyValue',DefaultEmptyValue) ;

will yield:

>> M
M =
1   0   1   0   1   99  99  99  99  99
1   0   1   0   1   0   1   0   1   99
1   0   1   99  99  99  99  99  99  99
1   0   1   99  99  99  99  99  99  99
1   0   1   0   1   99  99  99  99  99
0   1   0   1   0   1   0   1   0   99

Upvotes: 2

Related Questions