Reputation: 747

MATLAB data parse optimisation

I've been looking at reading a relatively large text file including columns of numbers interspersed with some other text, though really I just want the columns of numbers. There's a bunch of other text not shown here that's not at such regular intervals.

The file format:

*** LOTS OF OTHER TEXT AND NUMBERS ***

  iter  continuity  x-velocity  y-velocity           k     epsilon vf-vapour_ph     time/iter
   111  3.4714e-08  5.3037e-10  6.0478e-10  1.6219e-15  1.8439e-13  0.0000e+00  0:00:01   14
   112  3.2652e-08  5.0553e-10  5.6497e-10  1.3961e-15  1.5730e-13  0.0000e+00  0:00:01   13
   113  3.1371e-08  4.6175e-10  5.0506e-10  1.2020e-15  1.3419e-13  0.0000e+00  0:00:01   12
   114  3.0016e-08  4.4331e-10  4.7391e-10  1.0388e-15  1.1447e-13  0.0000e+00  0:00:01   11
   115  2.8702e-08  4.2111e-10  4.4778e-10  8.9904e-16  9.7680e-14  0.0000e+00  0:00:01   10
   116  2.7476e-08  4.1484e-10  4.2711e-10  7.7955e-16  8.3342e-14  0.0000e+00  0:00:01    9
   117  2.6436e-08  3.9556e-10  4.0601e-10  6.7890e-16  7.1113e-14  0.0000e+00  0:00:01    8
   118  2.5374e-08  3.8633e-10  3.8826e-10  5.9234e-16  6.0674e-14  0.0000e+00  0:00:00    7
   119  2.4292e-08  3.7473e-10  3.7584e-10  5.1814e-16  5.1786e-14  0.0000e+00  0:00:00    6
   120  2.3474e-08  3.5952e-10  3.5622e-10  4.5405e-16  4.4207e-14  0.0000e+00  0:00:00    5
   121  2.2612e-08  3.4485e-10  3.4159e-10  3.9910e-16  3.7707e-14  0.0000e+00  0:00:00    4
  iter  continuity  x-velocity  y-velocity           k     epsilon vf-vapour_ph     time/iter
   122  2.1992e-08  3.4100e-10  3.2964e-10  3.5272e-16  3.2204e-14  0.0000e+00  0:00:00    3
   123  2.1592e-08  3.2444e-10  3.0170e-10  3.1487e-16  2.7500e-14  0.0000e+00  0:00:00    2
   124  2.1053e-08  3.3145e-10  2.9325e-10  2.8009e-16  2.3485e-14  0.0000e+00  0:00:00    1
   125  2.0390e-08  3.1502e-10  2.7534e-10  2.5433e-16  2.0053e-14  0.0000e+00  0:00:00    0
  step  flow-time mfr_arm_inne mfr_arm_oute pressure_sta pressure_sta pressure_tot pressure_tot velocity_max velocity_min
     1  5.0000e-07 -5.5662e-08  1.4217e-07  6.0015e+00  5.9998e+00  6.0015e+00  5.9998e+00  2.8934e-04  3.3491e-10
Flow time = 5e-07s, time step = 1
799 more time steps

Updating solution at time levels N and N-1.
 done.


Writing data to output file.
Current time=0.000000  Position=-0.00000036409265555078  Velocity=0.000015  Net force=0.210322
Fluid force=-0.477050N, Stator force=0.200000N ,Spring force=-32.990534N ,Top force=0.000000N, Bottom force=33.007906N, External force=0.470000N

Next time=0.000001  Position=-0.00000036400170391852  Velocity=0.000182
Applying motion to dynamic zone.

*** CONTINUING TEXT AND NUMBERS ***

The lines I want are:

111  3.4714e-08  5.3037e-10  6.0478e-10  1.6219e-15  1.8439e-13  0.0000e+00  0:00:01   14
112  3.2652e-08  5.0553e-10  5.6497e-10  1.3961e-15  1.5730e-13  0.0000e+00  0:00:01   13

The script I have so far works, but takes about 80s to do the whole thing.

Made more awkward, I presume, by the colons in the time which are there in some of my files. Some files will have more or less columns containing different types of data, and some will have the additional set at the end of the main chunk such as:

  step  flow-time mfr_arm_inne mfr_arm_oute pressure_sta pressure_sta pressure_tot pressure_tot velocity_max velocity_min
     1  5.0000e-07 -5.5662e-08  1.4217e-07  6.0015e+00  5.9998e+00  6.0015e+00  5.9998e+00  2.8934e-04  3.3491e-10

I'm not looking to get this data, but it can have a very similar (sometimes the same) format as the lines I want.

It's essentially aiming to read each line and see if the few characters at the front of the line (based on the length of the iteration number) match the ones I'd be expecting (starting with 1, 2, 3... n). The reason I've done it this way is to try and remove the lines under "step..." which I don't want. However, the file is about 180,000 lines long (and it's my shortest) so you can imagine this gets a little slow.

% read the raw data from the file
file = 'file.txt';
fid = fopen(file, 'r');
raw = textscan(fid, '%s', 'Delimiter', '\n');
fid = fclose(fid);
raw = raw{1,1};

% expression used for splitting the columns up
colExpr = '[\d\.e:\-\+]+';

% beginning number
iterNum = 1;

% loop through lines
for line = 1:length(raw);

    % convert to string for comparison
    iterStr = num2str(iterNum);
    thisLine = raw{line, 1};

    % if the right length and the right string,
    if length(iterStr) <= length(thisLine) && ...
            strcmp(thisLine(1:length(iterStr)), iterStr)

        % split the string
        result(iterNum,:) = regexp(thisLine,colExpr, 'match');

        iterNum = iterNum + 1;

    end

end

% convert to matrix
residuals = cellfun(@str2num, result);

Using the profiler, I realise that the num2str() function is the slowest part (20s), followed by int2str() (10s), though I can't see a way of reading the data without it being part of the loop.

Wondering if there's something I'm missing to try and optimise this process?

EDIT:

I've included more of the lines that I don't want and a possible different format to try and help answers.

Upvotes: 2

Answers (3)

Amro

Reputation: 124563

Here is a different approach: we first process the file externally, with something like:

# only keep lines starting with a digit
$ grep '^\s*[0-9]' file.txt > file2.txt

On Windows, you can use findstr as equivalent to grep:

C:\> findstr /R /c:"^[ \t]*[0-9]" file.txt > file2.txt

Now in MATLAB, it's easy to load the resulting numeric data as a matrix:

>> load -ascii file2.txt
>> t = array2table(file2, 'VariableNames',...
    {'iter','continuity','xvelocity','yvelocity','k','epsilon','vf_vapour_ph'})
t = 
    iter    continuity    xvelocity     yvelocity        k          epsilon      vf_vapour_ph
    ____    __________    __________    _________    __________    __________    ____________
     1             0      6.2376e-07            0     0.0018988        2708.2    0           
     2             0         0.21656      0.23499     0.0097531       0.13395    0           
     3             0         0.11755      0.12824     0.0032109        0.1146    0           
     4             0        0.068112     0.072691    0.00089801      0.062219    0           
     5             0        0.043498     0.045244    0.00020248      0.025923    0           
     6        0.1938        0.029107     0.029029    4.8399e-05     0.0099171    0           
     7       0.13594        0.020037     0.019577    1.5502e-05     0.0043624    0           
     8      0.097518        0.013805     0.013249    5.1736e-06     0.0023341    0           
     9      0.070467       0.0098312    0.0091925    1.8272e-06     0.0012615    0           
    10      0.051538       0.0071181    0.0064673    7.2446e-07     0.0007012    0           
    11      0.038065       0.0052115    0.0046128    4.2786e-07    0.00040619    0           
    12      0.028369       0.0038465    0.0033381    2.8256e-07    0.00025864    0           
    13      0.021326        0.002857    0.0024454    1.9279e-07    0.00016126    0

Upvotes: 1

Suever

Reputation: 65450

Since you have the entire thing loaded into a cell array already (raw) you can call regexp directly on this to remove the bad rows.

%// Find lines that contain your data
matches = regexp(raw, '^\s*\d(.*?\de[+\-]\d){6}');

%// Empty matches (header lines) should be removed
toremove = cellfun(@isempty, matches);
raw = raw(~toremove);

Then you can convert the result into a numeric array using str2num combined with strjoin.

data = reshape(str2num(strjoin(raw)), 7, []).';

The benefit of this answer is that you avoid using any sort of looping or repeated function calls which are notorious for slowing MATLAB down.

Update

An alternate version of @Pursuit's answer would be something like:

numbers = cellfun(@(x)sscanf(x, '%f %f %f %f %f %f %f').', raw, 'uni', 0);
numbers = cat(1, numbers{:});

Upvotes: 1

Pursuit

Reputation: 12345

I would try running sscanf on each line, and only using the lines with a good hit.

Note that if:

raw{11} = '11  3.8065e-02  5.2115e-03  4.6128e-03  4.2786e-07  4.0619e-04  0.0000e+00'
raw{12} = 'iter  continuity  x-velocity  y-velocity           k     epsilon vf-vapour_ph'

Then

>> sscanf(raw{11},'%f')
ans =
                        11
                  0.038065
                 0.0052115
                 0.0046128
                4.2786e-07
                0.00040619
                         0

And:

>> sscanf(raw{12},'%f')
ans =
     []

To complete this thought, your code would look like this:

%% Read the file
file = 'dataFile.txt';
fid = fopen(file, 'r');
raw = textscan(fid, '%s', 'Delimiter', '\n');
fid = fclose(fid);
raw = raw{1,1}

%% Parse the file into the "residuals" variable

nextLine = 1; %This is the index of next line to insert

%Go through each line, one at a time
for ix = 1:length(raw)    
    %Parse the line with sscanf
    numbers = sscanf(raw{ix},'%f');

    if ~isempty(numbers)  %Skip any row that did not parse, otherwise ...
        %If you know the number of columns, you could replace "~isempty()" with "length()== "

        if nextLine == 1
            %If this is the first line of numbers, then initialize the
            %"residuals" variable.
            residuals= zeros(length(raw), length(numbers));
        end

        %Store the data, and increment "nextLine"
        residuals(nextLine,:) = numbers;
        nextLine = nextLine + 1;
    end
end

%Now, trim the excess alloction from "residuals"
residuals = residuals(1:(nextLine-1),:)

(Please let me know how it compares in speed.)

Upvotes: 0

MATLAB data parse optimisation

Answers (3)

Related Questions