user1155413
user1155413

Reputation: 169

Need help iterating over file with specific format

I'm trying to parse a txt file with a specific format and convert it to a CSV file. However i'm having problems two problems with it:

  1. I need to skip the header separating each entry (4 lines, first line begins with \n)
  2. It's only reading the last entry. I'm not sure what am I doing wrong so it reads all entries in the text file.

My code:

my $grammar = qr!
        ( ?(DEFINE)
           (?<Identifier> [^=\n]+ )
           (?<Statement>
               (?: # Begin alternation
                   " #Opening quotes
                   [^"]+? # Any non-quotes (including a new line)
                   " # Closing quotes
                  | [^\n]+ # Or a single line
               )   # End alternation
            )

       )

    !x;

    my $file = do { local $/; <> }; #Slurp file named on command line
    my %columns;
    while( $file =~
       m{ ((?&Identifier))[\t ]*=[ \t]*((?&Statement)) $grammar}xgc )
    {
       my ($header,$value) = ($1,$2);

           # Remove leading spaces and quote variable if it contains commas:
       for($header,$value) { s/^\s+//mg; /,/ and s/^|$/"/g }

           # Substitute \n with \\n to make multi-line values single-line:
       for($value) { chomp; s/\n/\\n/g }

       $columns{$header}=$value
    }

    print join "," => sort keys %columns; # Print column headers
    print "\n";
    print join "," => map { $columns{$_} } sort keys %columns; # Column content
    print "\n";

The input file looks like this:

OPERATION_CONTEXT server:.oc_name alarm_object 1
On director: server:.temip.prd1149_director
AT Thu, Jan 16, 2014 10:33:44 PM All Attributes

                             Identifier = 1
                                  State = Outstanding
                         Problem Status = Not-Handled
                  Clearance Report Flag = False
                        Escalated Alarm = False
                     Creation Timestamp = Thu, Jan 16, 2014 10:21:17 PM
                         Managed Object = NETACT server:.NETACT51 BSC 716499 BCF 123
                        Target Entities = { NETACT server:.NETACT51 BSC 716499 BCF 123 }
                             Alarm Type = EnvironmentalAlarm
                             Event Time = Thu, Jan 16, 2014 10:17:14 PM
                         Probable Cause = Indeterminate
                      Specific Problems = { 7409 }
                Notification Identifier = 2433009629
                                 Domain = Domain server:.netact51_dom
                           Alarm Origin = IncomingAlarm
                     Perceived Severity = Critical
                        Additional Text = "ALARMA CRITICA SISTEMA DAS 1900
                                          #S#10497409      ***                                       ZONA TECNICA SANTI
                                          PLMN-PLMN/BSC-716499/BCF-123

                                          SC_logical_name:9344;"
                      Original Severity = Critical
                    Original Event Time = Thu, Jan 16, 2014 10:17:14 PM
                            Outage Flag = False
                    Problem Occurrences = 1 Problems
               GPP3 Problem Occurrences = 0 Problems
           Critical Problem Occurrences = 1 Problems
              Major Problem Occurrences = 0 Problems
              Minor Problem Occurrences = 0 Problems
            Warning Problem Occurrences = 0 Problems
      Indeterminate Problem Occurrences = 0 Problems
              Clear Problem Occurrences = 0 Problems
                               SA Total = 0 Alarms
                                 Comuna = "HUECHURABA"
                             CatCliente = "CAV"
                               Nemonico = "BSMT6_PZANF3"

OPERATION_CONTEXT server:.oc_name alarm_object 2
On director: server:.temip.prd1149_director
AT Thu, Jan 16, 2014 10:33:44 PM All Attributes

                             Identifier = 2
                                  State = Outstanding
                         Problem Status = Not-Handled
                  Clearance Report Flag = False
                        Escalated Alarm = False
                     Creation Timestamp = Thu, Jan 16, 2014 10:14:03 PM
                   Clearance Time Stamp = Thu, Jan 16, 2014 10:29:08 PM
                         Managed Object = NETACT server:.NETACT51 BSC 206259 BCF 103
                        Target Entities = { NETACT server:.NETACT51 BSC 206259 BCF 103 }
                             Alarm Type = EnvironmentalAlarm
                             Event Time = Thu, Jan 16, 2014 10:29:37 PM
                         Probable Cause = Indeterminate
                      Specific Problems = { 7409 }
                Notification Identifier = 3780327614
                                 Domain = Domain server:.netact51_dom
                           Alarm Origin = IncomingAlarm
                     Perceived Severity = Critical
                        Additional Text = "ALARMA CRITICA SISTEMA DAS 1900
                                          #S#10497409      ***                                       ZONA TECNICA CENTR
                                          Merval                           BSC VLP7
                                          PLMN-PLMN/BSC-206259/BCF-103
                                          ALARMA CRITICA SISTEMA DAS 1900

                                          SC_logical_name:94681;"
                      Original Severity = Critical
                    Original Event Time = Thu, Jan 16, 2014 10:10:01 PM
                            Outage Flag = False
                    Problem Occurrences = 4 Problems
               GPP3 Problem Occurrences = 0 Problems
           Critical Problem Occurrences = 4 Problems
              Major Problem Occurrences = 0 Problems
              Minor Problem Occurrences = 0 Problems
            Warning Problem Occurrences = 0 Problems
      Indeterminate Problem Occurrences = 0 Problems
              Clear Problem Occurrences = 3 Problems
                               SA Total = 6 Alarms
                                 Comuna = "VINA DEL MAR"
                             CatCliente = "CAV"
                               Nemonico = "BVLP7_MVALF9"

OPERATION_CONTEXT server:.oc_name alarm_object 3
On director: server:.temip.prd1149_director
AT Thu, Jan 16, 2014 10:33:45 PM All Attributes

                             Identifier = 3
                                  State = Outstanding
                         Problem Status = Not-Handled
                  Clearance Report Flag = False
                        Escalated Alarm = False
                     Creation Timestamp = Thu, Jan 16, 2014 09:41:59 PM
                         Managed Object = NETACT server:.NETACT51 BSC 938189 BCF 61
                        Target Entities = { NETACT server:.NETACT51 BSC 938189 BCF 61 }
                             Alarm Type = EnvironmentalAlarm
                             Event Time = Thu, Jan 16, 2014 09:37:58 PM
                         Probable Cause = Indeterminate
                      Specific Problems = { 7405 }
                Notification Identifier = 1757596347
                                 Domain = Domain server:.netact51_dom
                           Alarm Origin = IncomingAlarm
                     Perceived Severity = Major
                        Additional Text = "NUSS FAILURE, RECTIFIER_1 ALARM
                                          #S#10497405      **                                        ZONA TECNICA CENTR
                                          Pelluhue Playa
                                          PLMN-PLMN/BSC-938189/BCF-61

                                          SC_logical_name:9679;"
                      Original Severity = Major
                    Original Event Time = Thu, Jan 16, 2014 09:37:58 PM
                            Outage Flag = False
                    Problem Occurrences = 1 Problems
               GPP3 Problem Occurrences = 0 Problems
           Critical Problem Occurrences = 0 Problems
              Major Problem Occurrences = 1 Problems
              Minor Problem Occurrences = 0 Problems
            Warning Problem Occurrences = 0 Problems
      Indeterminate Problem Occurrences = 0 Problems
              Clear Problem Occurrences = 0 Problems
                               SA Total = 0 Alarms
                                 Comuna = "PELLUHUE"
                             CatCliente = "UNIC_SITE"
                               Nemonico = "BTAL2_PYUEF6"

Thanks a lot in advance for any help you can give me!

Upvotes: 2

Views: 106

Answers (3)

Kenosis
Kenosis

Reputation: 6204

The following doesn't address your script, but offers a line-by-line parsing approach:

use strict;
use warnings;

my ( $showHeader, $lastID, @header, @columns ) = ( 1, '' );

while (<>) {
    if ( my ( $identifier, $statement ) = /^\s+(\S[^=]+)\s+=\s+(.+)/ ) {

        if (    $identifier eq 'Managed Object'
            and $lastID ne 'Clearance Time Stamp' )
        {
            push @header, 'Clearance Time Stamp' if $showHeader;
            push @columns, '';
        }

        if ( $identifier eq 'Additional Text' ) {
            while (<>) {
                my ($additional) = /^\s+(\S.+)/ or next;
                $statement .= $additional;
                last if $additional =~ /SC_logical_name/;
            }
            $statement =~ s/\s+/ /g;
        }

        push @header, $identifier if $showHeader;
        push @columns, $statement;

        if ( $identifier eq 'Nemonico' ) {
            if ($showHeader) {
                print +( join ',', @header ), "\n";
                $showHeader = 0;
            }

            print +( join ',', map { $_ = qq/"$_"/ if /,/ and !/^"/; $_ } @columns ), "\n";
            undef @columns;
        }
        $lastID = $identifier;
    }
}

Usage: perl script.pl inFile [>outFile.csv]

The last, optional parameter directs output to a file.

Multiple whitespaces are replaced by a single space in the field Additional Text.

Hope this helps!

Upvotes: 2

tangent
tangent

Reputation: 561

Your style is certainly unusual but your script does seem to work. The problem is you are clobbering each found entry with the next one. You say that you 'need to skip the header separating each entry' but as far as I can see your script already does that, so maybe I'm not understanding that. Anyway, these changes should fix your problem 2:

my %columns;
my $current_entry = ''; # add this
while( $file =~
    m{ ((?&Identifier))[\t ]*=[ \t]*((?&Statement)) $grammar}xgc ) {

    # ...removed

    for($value) { chomp; s/\n/\\n/g }

    # add this check to separate each entry
    if ($header eq 'Identifier ') {
        $current_entry = $value;
    }
    $columns{$current_entry}{$header}=$value;
}

# need to change the way you print the results
# assumes there is always a Identifier: 1
# and that the first entry contains all possible headers

my $first = $columns{1};
my @headers = sort keys %$first;
print join "," => @headers; # Print column headers
print "\n";
for my $key (sort {$a <=> $b} keys %columns) {
    my $entry = $columns{$key};
    print join "," => map { $entry->{$_} } @headers; # Column content
    print "\n";
}

Upvotes: 0

Borodin
Borodin

Reputation: 126722

Your Perl style is unusual and I find it very tricky to read, but you have treated the whole file as one long record. The header lines are ignored because they don't look like Identifer = Statement.

That means your hash elements are left set to the last value found for each identifer - in general this is the contents of the final record.

I believe you would be much better off relying less on regular expressions. The way you have it now it is (as you have found) very difficult to debug.

Upvotes: 0

Related Questions