Reading Irregular Text Files with MATLAB

Question

In short, I'm having a headache in multiple languages to read a txt file (linked below). My most familiar language is MATLAB so for that reason I'm using that in this example. I've found a way to read this file in ~ 5 minutes, but given I'll have tons and tons of data from my instrument shortly as it measures all day every 30 seconds this just isn't feasible.

I'm looking for a way to quickly read these irregular text files so that going forward I can knock these out with less of a time burden.

You can find my exact data at this link:

http://lb3.pandonia.net/BostonMA/Pandora107s1/L0/Pandora107s1_BostonMA_20190814_L0.txt.bz2

I've been using the "readtable" function in matlab and I have achieved a final product I want but I'm looking to increase the speed

Below is my code!

clearvars -except pan day1; % Clearing all variables except for the day and instrument variables.
close all;
clc;
pan_mat = [107 139 155 153]; % Matrix of pandora numbers for file-choosing 
reasons.

pan = pan_mat(pan); % pandora number I'm choosing
pan = num2str(pan); % Turning Pandora number into a string.
%pan = '107'
pandora = strcat('C:\Users	adams15\Desktop\Folders\Counts\Pandora_Dta\',pan) 
% string that designates file location



%date = '90919'

month = '09'; % Month
day2 = strcat('0',num2str(day1)) % Creating a day name for the figure I ultimately produce

cd(pandora)
d2 = strcat('2019',num2str(month),num2str(day2)); % The final date variable 
for the figure I produce
%file_pan = 'Pandora107s1_BostonMA_20190909_L0';
file_pan = strcat('Pandora',pan,'s1_BostonMA_',d2,'_L0'); % File name string

%Try reading it in line by line?
% Load in as a string and then convert the lines you want as numbers into
% number. 
delimiterIn = '	';
headerlinesIn = 41;
A = readtable(file_pan,'HeaderLines', 41, 'Delimiter', '	'); %Reading the 
file as a table
A = table2cell(A); % Converting file to a cell
A = regexp(A, ' ', 'split'); % converting cell to a structure matrix.

%%
A= array2table(A); % Converting Structure matrix back to table
row_num = 0;
pan_mat_2 = zeros(2359,4126);
datetime_mat = zeros(2359,2);
blank = 0;

%% Converting data to proper matrices
[length width] = size(A);

% The matrix below is going through "A" and writing from it to a new
% matrix, "pan_mat_2" which is my final product as well as singling out the
% rows that contain non-number variables I'd like to keep and adding them
% later.
tic
%flag1
for i = 1:length; % Make second number the length of the table, A
    blank = 0;
    b = table2array(A{i,1});
    [rows, columns] = size(b);
    if columns > 4120 && columns < 4140
       row_num = row_num + 1;
       blank = regexp(b(2), 'T', 'split');
       blank2 = regexp(blank{1,1}(2), 'Z', 'split');
       datetime_mat(row_num,1) = str2double(blank{1,1}(1));
       datetime_mat(row_num,2) = str2double(blank2{1,1}(1));
        for j = 1:4126;
            pan_mat_2(row_num,j) = str2double(b(j));
        end
    end
end
toc
%flag2

In short, I'm already getting the result I want but the part of the code where I'm writing to a new array "flag 1" to "flag 2" is taking roughly 222 seconds while the entire code only takes about 248 seconds. I'd like to find a better way to create the data there than to write it to a new array and take a whole bunch of time.

Any suggestions?

Hoki · Accepted Answer

Note:

There are a quite a few improvments you can make for speed but there are also corrections. You preallocate you final output variable with hard coded values:

pan_mat_2 = zeros(2359,4126);

But later you populate it in a loop which run for i = 1:length.

length is the full number of lines picked from the file. In your example file there are only 784 lines. So even if all your line were valid (ok to be parsed), you would only ever fill the first 784 lines of the total 2359 lines you allocated in your pan_mat_2. In practice, this file has only 400 valid data lines, so your pan_mat_2 could definitely be smaller.

I know you couldn't know you had only 400 line parsed before you parsed them, but you knew from the beginning that you had only 784 line to parse (you had the info in the variable length). So in case like these pre-allocate to 784 and only later discard the empty lines.

Fortunately, the solution I propose does not need to pre-allocate larger then discard. The matrices will end up the right size from the start.

The code:

%%
file_pan = 'Pandora107s1_BostonMA_20190814_L0.txt' ;
delimiterIn = '	';
headerlinesIn = 41;
A = readtable(file_pan,'HeaderLines', 41, 'Delimiter', '	'); %Reading the file as a table
A = table2cell(A); % Converting file to a cell
A = regexp(A, ' ', 'split'); % converting cell to a structure matrix.

%% Remove lines which won't be parsed
% Count the number of elements in each line
nelem = cell2mat( cellfun( @size , A  ,'UniformOutput',0) ) ;
nelem(:,1) = [] ;
% find which lines does not have enough elements to be parsed
idxLine2Remove = ~(nelem > 4120 & nelem < 4140) ;
% remove them from the data set
A(idxLine2Remove) = [] ;

%% Remove nesting in cell array
nLinesToParse = size(A,1) ;
A = reshape( [A{:}] , [], nLinesToParse ).' ;
% now you have a cell array of size [400x4126] cells

%% Now separate the columns with different data type
% Column 1          => [String] identifier
% Column 2          => Timestamp
% Column 3 to 4125  => Numeric values
% Column 4126       => empty cell created during the 'split' operation above
%                      because of a trailing space character.
LineIDs    = A(:,1) ;
TimeStamps = A(:,2) ;
Data       = A(:,3:end-1)  ; % fetch to "end-1" to discard last empty column

%% now extract the values

% You could do that directly:
%     pan_mat = str2double(Data) ;

% but this takes a long time. A much computationnaly faster way (even if it
% uses more complex code) would be:
dat = strjoin(Data) ;                                       % create a single long string made of all the strings in all the cells
nums = textscan( dat , '%f' , Inf ) ;                       % call textscan on it (way faster than str2double() )
pan_mat = reshape( cell2mat( nums ) , nLinesToParse ,[] ) ; % reshape to original dimensions

%% timestamps
% convert to character array
strTimeStamps = char(TimeStamps) ;
% convert to matlab own datetime numbering. This will be a lot faster if
% you have operations to do on the time stamps later
ts = datenum(strTimeStamps,'yyyymmddTHHMMSSZ') ;


%% If you really want them the way you had it in your example
strTimeStamps(:,9)   = ' ' ; % replace 'T' with ' '
strTimeStamps(:,end) = ' ' ; % replace 'Z' characters with ' '
%then same again, merge into a long string, parse then reshape accordingly
strdate = reshape(strTimeStamps.',1,[]) ;
tmp = textscan( strdate , '%d' , Inf ) ;
datetime_mat = reshape( double(cell2mat(tmp)),2,[]).' ;

The performance:

As you can see on my machine your original code takes ~102 seconds to execute, with 80% of that (81s) spent on calling the function str2double() 3,302,400 times!

My solution, run on the same input file, takes ~5.5 seconds, with half of the time spent on calling strjoin() 3 times.

When you read the code above, try to understand how I limited the repetition of function call in lengthy loops by trying to keep everything as vectorised as possible.

Reading Irregular Text Files with MATLAB

Answers (2)

Related Questions