jlb333333
jlb333333

Reputation: 371

Perl regex: More than one line to match and output in a given format: Completely revised

If one can understand how to store stuff in memory, know where it is and how to get it out again in an orderly way it will go a long way to achieving results in Perl.(And probably all programming languages)

I am not a programmer.

I am trying to extract data from 'older program' output and import it into a SQL database. The extraction is the thing.

My previous question was largely incorrect, as I found out when importing data into the table, as I did not have enough data from the 'old program' output file.

I would like to learn from my mistake and re-ask my previous question, hopefully correctly this time.

I have included my poor effort at extracting the data, exactly as it was last time. It doesn't come anywhere near getting the correct data out.

I believe this is quite a complex question but maybe it isn't. It is certainly above my level of Perl at present, and maybe ever.

Answers to my incorrectly phrased question have been partially understood. Thank you very much for them.

If I could summarize it, my main problem with this task is dealing with the type of question: 'If a line contains ... get data from two lines up and insert it at the beginning of .... Seemingly impossible for me. I tried regex over the end of line searches but was unable to get that to work. I was unable also to arrange successive loops to insert data in lines as I wanted. If one loop worked, the next did not and so on. I was prepared to work on successive files in a step by step process but the 'two lines up' question stumped me completely.

I was able to extract other data from these output files relatively easily as they are very orderly files, but this particular question has me stumped.

My revised question is:

My input file consists of batches of data(+-50 - 70 lines long) in the following format:

1(P1)      3 P.ell    05/0120 W/P068819 0 12.0 98/99  380 380           C03     104          PROCESSED 21/02/16 TIME 22.16.52 KSINA=8

AGE   SPH   %THN  %INC     SV SI   MAI20  HTPC  VIPC   AGE BA  DBH   HT  SPH   CIH% CIV%    CVD BCON    CMAI      C0   C0CAL    SI20
0  1100     .0  89.0%SPH  2  2     .00    .0    .0 20.00  1   .0 17.3    0     .0   .0     .0   .0    .0000    .000    0%      .00
 7   815   25.9    .0      2  2    9.90  75.5  47.2 20.00  1 26.6 17.3  330     .0   .0     .0 13.0    .2099   1.005   .000   17.30
13   550   32.5    .0                                                                                                              
18   330   40.0    .0                                                                                                              
45     0  100.0    .0                                                                                                              
0SQ     -4  -4  -4  = SI20  17 17 17              PLANTN---104 GREEN MEADOWS       MODEL---P.ELLIOTTII MAC MAC   SQ    10 SI20  22.90
HTPC   76  76  76  =MAI20  10 10 10 FROM HTPC                                                                   HTPC 100 MAI20 20.71
VIPC   47  47  47  =MAI20  10 10 10              HTPC/VIPC REGRESSION---P.ELLIOTTII GENERAL 1/83                VIPC 100 MAI20 20.99
                 MAIDBH     0                                                                                                    
                 INMAI==>   0                                                                                                    
0INPUT FOR CALCULATING HTPC & VIPC =  HT  ---- ----                                                                                  

AGE  DBH   HT  VTREE  SPH  BA  TOTAL WS  UTIL    S    A    B    C    D   TCAI CTCAI  TMAI   UCAI CUCAI  UMAI   SCAI CSCAI  SMAI IAGE

1   .0   .2  .0000  979   0      0  0     0    0    0    0    0    0     .0     0    .0     .0     0    .0     .0     0    .0  1.0
2   .0   .9  .0000  979   0      0  0     0    0    0    0    0    0     .0     0    .0     .0     0    .0     .0     0    .0  2.0
3  3.9  2.0  .0007  979   1      1  1     0    0    0    0    0    0     .7     1    .2     .0     0    .0     .0     0    .0  3.0
4  7.1  3.4  .0041  979   4      4  3     1    1    0    0    0    0    3.4     4   1.0     .6     1    .2     .0     0    .0  4.0
5  9.4  4.6  .0102  979   7     10  5     5    5    0    0    0    0    5.9    10   2.0    4.1     5    .9     .0     0    .0  5.0
6 11.3  5.7  .0188  979  10     18  6    12   12    1    0    0    0    8.4    18   3.1    7.5    12   2.0     .0     0    .0  6.0
7 13.0  6.7  .0293  979  13     29  7    22   19    3    0    0    0   10.3    29   4.1    9.7    22   3.1     .0     0    .0  7.0
                   17%                                                                                                           
THN 11.4  6.7  .0230  164   2      4  1     3    2    0    0    0    0                                                              
REM 13.4  6.7  .0315  815  12     26  6    20   17    3    0    0    0                                                              

 8 15.0  7.6  .0453  815  14     37  6    31   21   10    0    0    0   11.2    40   5.0   10.9    33   4.1     .0     0    .0  7.6
 9 16.4  8.5  .0607  815  17     49  6    43   23   20    0    0    0   12.5    52   5.8   12.2    45   5.0     .2     0    .0  8.6
10 17.4  9.4  .0771  815  19     63  7    56   24   30    2    0    0   13.4    66   6.6   13.1    58   5.8    1.3     2    .2  9.6
11 18.3 10.3  .0941  815  21     77  7    70   24   41    5    0    0   13.9    80   7.3   13.6    72   6.5    3.0     5    .4 10.6
12 19.0 11.3  .1118  815  23     91  7    84   24   50   10    0    0   14.4    94   7.8   14.1    86   7.2    5.4    10    .8 11.6
13 19.6 12.2  .1299  815  25    106  8    98   24   56   18    0    0   14.7   109   8.4   14.4   100   7.7    8.0    18   1.4 12.6
                   33%                                                                                                           
THN 17.5 12.2  .1044  265   6     28  2    25    8   15    3    0    0                                                              
REM 20.6 12.2  .1421  550  18     78  5    73   16   42   15    0    0                                                              

14 21.3 13.0  .1636  550  20     90  6    84   16   44   25    0    0   11.8   121   8.6   11.6   112   8.0   10.0    28   2.0 10.4
15 22.0 13.7  .1864  550  21    103  6    97   16   45   36    0    0   12.5   133   8.9   12.3   124   8.3   11.0    39   2.6 11.2
16 22.7 14.5  .2100  550  22    116  6   109   15   46   48    0    0   13.0   146   9.1   12.7   137   8.6   12.0    51   3.2 12.0
17 23.3 15.3  .2345  550  23    129  6   123   15   46   61    0    0   13.5   160   9.4   13.2   150   8.8   12.9    64   3.8 12.8
18 23.9 15.9  .2598  550  25    143  7   136   15   46   74    1    0   13.9   174   9.6   13.6   164   9.1   13.8    78   4.3 13.6
                   40%                                                                                                           
THN 21.6 15.9  .2142  220   8     47  2    45    6   19   20    0    0                                                              
REM 25.3 15.9  .2901  330  17     96  4    92    9   28   54    1    0                                                              

19 26.0 16.6  .3203  330  17    106  4   101    9   27   63    3    0   10.0   184   9.7    9.8   174   9.1   10.5    88   4.6 11.0
20 26.6 17.3  .3519  330  18    116  5   112    9   27   71    5    0   10.4   194   9.7   10.2   184   9.2   10.6    99   4.9 11.7
21 27.2 18.0  .3849  330  19    127  5   122    9   27   80    8    0   10.9   205   9.8   10.7   194   9.3   11.1   110   5.2 12.4
22 27.9 18.7  .4192  330  20    138  5   133    8   26   87   11    0   11.3   216   9.8   11.1   206   9.3   11.5   121   5.5 13.2
23 28.4 19.3  .4546  330  21    150  5   145    8   26   94   16    0   11.7   228   9.9   11.4   217   9.4   11.8   133   5.8 14.0
24 29.0 20.0  .4914  330  22    162  5   157    8   26  101   22    0   12.2   240  10.0   11.9   229   9.5   12.3   145   6.1 14.9
25 29.6 20.6  .5292  330  23    175  6   169    8   25  106   29    0   12.5   253  10.1   12.2   241   9.6   12.6   158   6.3 15.7
26 30.2 21.2  .5682  330  24    188  6   182    8   25  112   37    0   12.9   265  10.2   12.6   254   9.8   13.0   171   6.6 16.5
27 30.7 21.8  .6083  330  25    201  6   194    8   25  115   46    0   13.2   279  10.3   13.0   267   9.9   13.3   184   6.8 17.3
28 31.3 22.4  .6492  330  25    214  7   208    8   24  119   56    1   13.5   292  10.4   13.2   280  10.0   13.6   198   7.1 18.2
29 31.9 23.0  .6908  330  26    228  7   221    8   24  122   65    2   13.7   306  10.5   13.5   293  10.1   13.8   212   7.3 19.0
30 32.4 23.5  .7332  330  27    242  7   235    8   24  123   77    3   14.0   320  10.7   13.7   307  10.2   14.0   226   7.5 19.8
31 33.0 23.9  .7766  330  28    256  7   249    8   24  125   88    5   14.3   334  10.8   14.0   321  10.4   14.3   240   7.7 20.4
32 33.6 24.4  .8202  330  29    271  8   263    8   23  126   99    7   14.4   349  10.9   14.1   335  10.5   14.4   255   8.0 21.0

Firstly the two variables in the first line(1(P1...): in this case 'C03 104' need to be extracted from it and be sent to OUTPUT.(Same as previous question, but the output position changes.)

Secondly, all lines beginning with 'THN' need to be extracted as they are except that the THN can be dropped. If there are two, three, four or even five etc. 'THN' lines, they all need to be extracted from the batch and sent to OUTPUT.(+- same as previous question)

Thirdly, although sequentially the second step, the last figure in the 'AGE' column of the main tabular data just before the 'THN' line, needs to be attached to the extracted 'THN' line directly below it.(in this case the figures 7, 13 and 18) These need to be added to their respective THN lines. See expected output below where the ages have been inserted after the two 'C03 104' variables in each line.

If there are no 'THN' lines in a given batch, the entire batch should be ignored, with no output, and the next batch(starting with a '1(P1)' again) considered.

The correct output expected from the above batch is:

 CO3   104     7  11.4  6.7  .0230  164   2      4  1     3    2    0    0    0    0
 CO3   104    13  17.5 12.2  .1044  265   6     28  2    25    8   15    3    0    0
 CO3   104    18  21.6 15.9  .2142  220   8     47  2    45    6   19   20    0    0

As will be seen from this, the two variables from the top line are inserted at the start of the output THN data line. The age figure read from the input batch is then inserted into its respective THN line and thereafter the rest of the THN line data is attached.

My effort some time ago but not updated is as follows:

while ( my $line = <INPUT> ) {

                if($line =~ /\s{6,11}(\w{1}\d{1}\w{0,5})\s{0,5}(\d{3})/) {
                my @c_no = "$1,$2\n";   
                foreach (@c_no) {
                print OUTPUT $_;                    
                }
                if ($line =~ /^(\s{1}THN)(\s{1,3}\d{0,2}.\d)(\s{1,3}\d{0,2}.\d)(\s{1,2}\d{0,1}.\d{4})(\s{1,2}\d{2,4})
                            (\s{2,3}\d{1,2})(\s{1,6}\d{1,4})(\s{1,2}\d{1,2})(\s{1,5}\d{1,4})(\s{1,4}\d{1,4})
                            (\s{1,4}\d{1,4})(\s{1,4}\d{1,4})(\s{1,4}\d{1,4})(\s{1,4}\d{1,4})|(^1(P1))/x){

                print OUTPUT "$1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14\n";
                }           
                }
                }

Advice, guidance and help would be greatly appreciated.

Upvotes: 1

Views: 75

Answers (2)

Borodin
Borodin

Reputation: 126722

This is much more simply done using split to separate each line into space-delimited fields, and quite straightforward if you maintain state variables for the two fields from the header row, and the age from any row whose first field is entirely digits. Then all that is necessary is to print these three values before the numbers on any line that starts with THN

Note that it's simplest to pass the name of the input file as a parameter on the command line. Then all you have to do is read from <>. All the opening and error handling are already done for you

The output format you've asked for is rather esoteric. I can't see any pattern to the column widths and I've had to write a custom printf format to recreate it. If you need something else then all the values in each output line are in the @data array, which you can use as you wish

use strict;
use warnings 'all';

my ($c1, $c2, $age);

while ( <> ) {

    next unless /\S/;

    my @fields = split;

    if ( $fields[0] eq '1(P1)' ) {
        ($c1, $c2) = @fields[10,11];
    }
    elsif ( $fields[0] !~ /\D/ ) {
        $age = $fields[0];
    }
    elsif ( $fields[0] eq 'THN' ) {
        my @data = ( $c1, $c2, $age, @fields[1..13] );
        printf "%4s %5s %5d %5.1f%5.1f%7.4f%5d%4d%7d%3d%6d%5d%5d%5d%5d%5d\n", @data;
    }
}

output

 C03   104     7  11.4  6.7 0.0230  164   2      4  1     3    2    0    0    0    0
 C03   104    13  17.5 12.2 0.1044  265   6     28  2    25    8   15    3    0    0
 C03   104    18  21.6 15.9 0.2142  220   8     47  2    45    6   19   20    0    0

Upvotes: 1

aghast
aghast

Reputation: 15310

I copied and modified your example data, so this hasn't had a really good test. And I'm printing to STDOUT for testing purposes, but that should be easy to change.

The trick is to recognize that you've got line matching to do, which is great with regexes, and other processing, which is probably better with plain old code. So build a little loop, and process the lines with equal precedence (this is important for detecting errors in the file - don't try to nest things too much). Put in some state variables to help keep track of what comes next, and be sure to reset them appropriately.

Also, one thing I noticed in your example code is that you spent a lot if time getting spacing and number-of-digits right for the fields. That was almost certainly wasted time in this context, since the key was the "THN" at the start of the line. One trick with processing text is to focus on the things you really need, and use .* for the other stuff. That way, line noise or a syntax error or some strange formatting glitch won't screw up your program. (Sometimes .* becomes [^"]* or whatever, but you take the point...)

my $line_prefix, $have_age_col, $age_col;
while (<>) {
    if (/^1\(P1\).*\s(?P<two_vars>\w+\s+\w+)\s+PROCESSED .* TIME .* KSINA=.*$/) {
        # Start new section
        $line_prefix = $+{two_vars};
        $have_age_col = 0;
        $age_col = undef;
    }
    if (/^AGE /) {
        $have_age_col = 1;
    }
    if ($have_age_col && /^\s{0,5}(\d+)/) {
        $age_col = substr "    ".$1, -5;
    }
    if (/^THN /) {
        die "THN encountered without header"
            unless $line_prefix;
        die "THN encountered without age column"
            unless $have_age_col and $age_col;

        s/^THN \s*//;
        s/\s+$//;
        my $output = "$line_prefix $age_col $_\n";
        print STDOUT $output;
    }
}

Upvotes: 0

Related Questions