Reputation: 747
I've been looking at reading a relatively large text file including columns of numbers interspersed with some other text, though really I just want the columns of numbers. There's a bunch of other text not shown here that's not at such regular intervals.
The file format:
*** LOTS OF OTHER TEXT AND NUMBERS ***
iter continuity x-velocity y-velocity k epsilon vf-vapour_ph time/iter
111 3.4714e-08 5.3037e-10 6.0478e-10 1.6219e-15 1.8439e-13 0.0000e+00 0:00:01 14
112 3.2652e-08 5.0553e-10 5.6497e-10 1.3961e-15 1.5730e-13 0.0000e+00 0:00:01 13
113 3.1371e-08 4.6175e-10 5.0506e-10 1.2020e-15 1.3419e-13 0.0000e+00 0:00:01 12
114 3.0016e-08 4.4331e-10 4.7391e-10 1.0388e-15 1.1447e-13 0.0000e+00 0:00:01 11
115 2.8702e-08 4.2111e-10 4.4778e-10 8.9904e-16 9.7680e-14 0.0000e+00 0:00:01 10
116 2.7476e-08 4.1484e-10 4.2711e-10 7.7955e-16 8.3342e-14 0.0000e+00 0:00:01 9
117 2.6436e-08 3.9556e-10 4.0601e-10 6.7890e-16 7.1113e-14 0.0000e+00 0:00:01 8
118 2.5374e-08 3.8633e-10 3.8826e-10 5.9234e-16 6.0674e-14 0.0000e+00 0:00:00 7
119 2.4292e-08 3.7473e-10 3.7584e-10 5.1814e-16 5.1786e-14 0.0000e+00 0:00:00 6
120 2.3474e-08 3.5952e-10 3.5622e-10 4.5405e-16 4.4207e-14 0.0000e+00 0:00:00 5
121 2.2612e-08 3.4485e-10 3.4159e-10 3.9910e-16 3.7707e-14 0.0000e+00 0:00:00 4
iter continuity x-velocity y-velocity k epsilon vf-vapour_ph time/iter
122 2.1992e-08 3.4100e-10 3.2964e-10 3.5272e-16 3.2204e-14 0.0000e+00 0:00:00 3
123 2.1592e-08 3.2444e-10 3.0170e-10 3.1487e-16 2.7500e-14 0.0000e+00 0:00:00 2
124 2.1053e-08 3.3145e-10 2.9325e-10 2.8009e-16 2.3485e-14 0.0000e+00 0:00:00 1
125 2.0390e-08 3.1502e-10 2.7534e-10 2.5433e-16 2.0053e-14 0.0000e+00 0:00:00 0
step flow-time mfr_arm_inne mfr_arm_oute pressure_sta pressure_sta pressure_tot pressure_tot velocity_max velocity_min
1 5.0000e-07 -5.5662e-08 1.4217e-07 6.0015e+00 5.9998e+00 6.0015e+00 5.9998e+00 2.8934e-04 3.3491e-10
Flow time = 5e-07s, time step = 1
799 more time steps
Updating solution at time levels N and N-1.
done.
Writing data to output file.
Current time=0.000000 Position=-0.00000036409265555078 Velocity=0.000015 Net force=0.210322
Fluid force=-0.477050N, Stator force=0.200000N ,Spring force=-32.990534N ,Top force=0.000000N, Bottom force=33.007906N, External force=0.470000N
Next time=0.000001 Position=-0.00000036400170391852 Velocity=0.000182
Applying motion to dynamic zone.
*** CONTINUING TEXT AND NUMBERS ***
The lines I want are:
111 3.4714e-08 5.3037e-10 6.0478e-10 1.6219e-15 1.8439e-13 0.0000e+00 0:00:01 14
112 3.2652e-08 5.0553e-10 5.6497e-10 1.3961e-15 1.5730e-13 0.0000e+00 0:00:01 13
The script I have so far works, but takes about 80s to do the whole thing.
Made more awkward, I presume, by the colons in the time which are there in some of my files. Some files will have more or less columns containing different types of data, and some will have the additional set at the end of the main chunk such as:
step flow-time mfr_arm_inne mfr_arm_oute pressure_sta pressure_sta pressure_tot pressure_tot velocity_max velocity_min
1 5.0000e-07 -5.5662e-08 1.4217e-07 6.0015e+00 5.9998e+00 6.0015e+00 5.9998e+00 2.8934e-04 3.3491e-10
I'm not looking to get this data, but it can have a very similar (sometimes the same) format as the lines I want.
It's essentially aiming to read each line and see if the few characters at the front of the line (based on the length of the iteration number) match the ones I'd be expecting (starting with 1, 2, 3... n). The reason I've done it this way is to try and remove the lines under "step..." which I don't want. However, the file is about 180,000 lines long (and it's my shortest) so you can imagine this gets a little slow.
% read the raw data from the file
file = 'file.txt';
fid = fopen(file, 'r');
raw = textscan(fid, '%s', 'Delimiter', '\n');
fid = fclose(fid);
raw = raw{1,1};
% expression used for splitting the columns up
colExpr = '[\d\.e:\-\+]+';
% beginning number
iterNum = 1;
% loop through lines
for line = 1:length(raw);
% convert to string for comparison
iterStr = num2str(iterNum);
thisLine = raw{line, 1};
% if the right length and the right string,
if length(iterStr) <= length(thisLine) && ...
strcmp(thisLine(1:length(iterStr)), iterStr)
% split the string
result(iterNum,:) = regexp(thisLine,colExpr, 'match');
iterNum = iterNum + 1;
end
end
% convert to matrix
residuals = cellfun(@str2num, result);
Using the profiler, I realise that the num2str()
function is the slowest part (20s), followed by int2str()
(10s), though I can't see a way of reading the data without it being part of the loop.
Wondering if there's something I'm missing to try and optimise this process?
EDIT:
I've included more of the lines that I don't want and a possible different format to try and help answers.
Upvotes: 2
Views: 76
Reputation: 124563
Here is a different approach: we first process the file externally, with something like:
# only keep lines starting with a digit
$ grep '^\s*[0-9]' file.txt > file2.txt
On Windows, you can use findstr
as equivalent to grep
:
C:\> findstr /R /c:"^[ \t]*[0-9]" file.txt > file2.txt
Now in MATLAB, it's easy to load the resulting numeric data as a matrix:
>> load -ascii file2.txt
>> t = array2table(file2, 'VariableNames',...
{'iter','continuity','xvelocity','yvelocity','k','epsilon','vf_vapour_ph'})
t =
iter continuity xvelocity yvelocity k epsilon vf_vapour_ph
____ __________ __________ _________ __________ __________ ____________
1 0 6.2376e-07 0 0.0018988 2708.2 0
2 0 0.21656 0.23499 0.0097531 0.13395 0
3 0 0.11755 0.12824 0.0032109 0.1146 0
4 0 0.068112 0.072691 0.00089801 0.062219 0
5 0 0.043498 0.045244 0.00020248 0.025923 0
6 0.1938 0.029107 0.029029 4.8399e-05 0.0099171 0
7 0.13594 0.020037 0.019577 1.5502e-05 0.0043624 0
8 0.097518 0.013805 0.013249 5.1736e-06 0.0023341 0
9 0.070467 0.0098312 0.0091925 1.8272e-06 0.0012615 0
10 0.051538 0.0071181 0.0064673 7.2446e-07 0.0007012 0
11 0.038065 0.0052115 0.0046128 4.2786e-07 0.00040619 0
12 0.028369 0.0038465 0.0033381 2.8256e-07 0.00025864 0
13 0.021326 0.002857 0.0024454 1.9279e-07 0.00016126 0
Upvotes: 1
Reputation: 65450
Since you have the entire thing loaded into a cell array already (raw
) you can call regexp
directly on this to remove the bad rows.
%// Find lines that contain your data
matches = regexp(raw, '^\s*\d(.*?\de[+\-]\d){6}');
%// Empty matches (header lines) should be removed
toremove = cellfun(@isempty, matches);
raw = raw(~toremove);
Then you can convert the result into a numeric array using str2num
combined with strjoin
.
data = reshape(str2num(strjoin(raw)), 7, []).';
The benefit of this answer is that you avoid using any sort of looping or repeated function calls which are notorious for slowing MATLAB down.
Update
An alternate version of @Pursuit's answer would be something like:
numbers = cellfun(@(x)sscanf(x, '%f %f %f %f %f %f %f').', raw, 'uni', 0);
numbers = cat(1, numbers{:});
Upvotes: 1
Reputation: 12345
I would try running sscanf
on each line, and only using the lines with a good hit.
Note that if:
raw{11} = '11 3.8065e-02 5.2115e-03 4.6128e-03 4.2786e-07 4.0619e-04 0.0000e+00'
raw{12} = 'iter continuity x-velocity y-velocity k epsilon vf-vapour_ph'
Then
>> sscanf(raw{11},'%f')
ans =
11
0.038065
0.0052115
0.0046128
4.2786e-07
0.00040619
0
And:
>> sscanf(raw{12},'%f')
ans =
[]
To complete this thought, your code would look like this:
%% Read the file
file = 'dataFile.txt';
fid = fopen(file, 'r');
raw = textscan(fid, '%s', 'Delimiter', '\n');
fid = fclose(fid);
raw = raw{1,1}
%% Parse the file into the "residuals" variable
nextLine = 1; %This is the index of next line to insert
%Go through each line, one at a time
for ix = 1:length(raw)
%Parse the line with sscanf
numbers = sscanf(raw{ix},'%f');
if ~isempty(numbers) %Skip any row that did not parse, otherwise ...
%If you know the number of columns, you could replace "~isempty()" with "length()== "
if nextLine == 1
%If this is the first line of numbers, then initialize the
%"residuals" variable.
residuals= zeros(length(raw), length(numbers));
end
%Store the data, and increment "nextLine"
residuals(nextLine,:) = numbers;
nextLine = nextLine + 1;
end
end
%Now, trim the excess alloction from "residuals"
residuals = residuals(1:(nextLine-1),:)
(Please let me know how it compares in speed.)
Upvotes: 0