Reputation: 169
I'm trying to parse a txt file with a specific format and convert it to a CSV file. However i'm having problems two problems with it:
My code:
my $grammar = qr!
( ?(DEFINE)
(?<Identifier> [^=\n]+ )
(?<Statement>
(?: # Begin alternation
" #Opening quotes
[^"]+? # Any non-quotes (including a new line)
" # Closing quotes
| [^\n]+ # Or a single line
) # End alternation
)
)
!x;
my $file = do { local $/; <> }; #Slurp file named on command line
my %columns;
while( $file =~
m{ ((?&Identifier))[\t ]*=[ \t]*((?&Statement)) $grammar}xgc )
{
my ($header,$value) = ($1,$2);
# Remove leading spaces and quote variable if it contains commas:
for($header,$value) { s/^\s+//mg; /,/ and s/^|$/"/g }
# Substitute \n with \\n to make multi-line values single-line:
for($value) { chomp; s/\n/\\n/g }
$columns{$header}=$value
}
print join "," => sort keys %columns; # Print column headers
print "\n";
print join "," => map { $columns{$_} } sort keys %columns; # Column content
print "\n";
The input file looks like this:
OPERATION_CONTEXT server:.oc_name alarm_object 1
On director: server:.temip.prd1149_director
AT Thu, Jan 16, 2014 10:33:44 PM All Attributes
Identifier = 1
State = Outstanding
Problem Status = Not-Handled
Clearance Report Flag = False
Escalated Alarm = False
Creation Timestamp = Thu, Jan 16, 2014 10:21:17 PM
Managed Object = NETACT server:.NETACT51 BSC 716499 BCF 123
Target Entities = { NETACT server:.NETACT51 BSC 716499 BCF 123 }
Alarm Type = EnvironmentalAlarm
Event Time = Thu, Jan 16, 2014 10:17:14 PM
Probable Cause = Indeterminate
Specific Problems = { 7409 }
Notification Identifier = 2433009629
Domain = Domain server:.netact51_dom
Alarm Origin = IncomingAlarm
Perceived Severity = Critical
Additional Text = "ALARMA CRITICA SISTEMA DAS 1900
#S#10497409 *** ZONA TECNICA SANTI
PLMN-PLMN/BSC-716499/BCF-123
SC_logical_name:9344;"
Original Severity = Critical
Original Event Time = Thu, Jan 16, 2014 10:17:14 PM
Outage Flag = False
Problem Occurrences = 1 Problems
GPP3 Problem Occurrences = 0 Problems
Critical Problem Occurrences = 1 Problems
Major Problem Occurrences = 0 Problems
Minor Problem Occurrences = 0 Problems
Warning Problem Occurrences = 0 Problems
Indeterminate Problem Occurrences = 0 Problems
Clear Problem Occurrences = 0 Problems
SA Total = 0 Alarms
Comuna = "HUECHURABA"
CatCliente = "CAV"
Nemonico = "BSMT6_PZANF3"
OPERATION_CONTEXT server:.oc_name alarm_object 2
On director: server:.temip.prd1149_director
AT Thu, Jan 16, 2014 10:33:44 PM All Attributes
Identifier = 2
State = Outstanding
Problem Status = Not-Handled
Clearance Report Flag = False
Escalated Alarm = False
Creation Timestamp = Thu, Jan 16, 2014 10:14:03 PM
Clearance Time Stamp = Thu, Jan 16, 2014 10:29:08 PM
Managed Object = NETACT server:.NETACT51 BSC 206259 BCF 103
Target Entities = { NETACT server:.NETACT51 BSC 206259 BCF 103 }
Alarm Type = EnvironmentalAlarm
Event Time = Thu, Jan 16, 2014 10:29:37 PM
Probable Cause = Indeterminate
Specific Problems = { 7409 }
Notification Identifier = 3780327614
Domain = Domain server:.netact51_dom
Alarm Origin = IncomingAlarm
Perceived Severity = Critical
Additional Text = "ALARMA CRITICA SISTEMA DAS 1900
#S#10497409 *** ZONA TECNICA CENTR
Merval BSC VLP7
PLMN-PLMN/BSC-206259/BCF-103
ALARMA CRITICA SISTEMA DAS 1900
SC_logical_name:94681;"
Original Severity = Critical
Original Event Time = Thu, Jan 16, 2014 10:10:01 PM
Outage Flag = False
Problem Occurrences = 4 Problems
GPP3 Problem Occurrences = 0 Problems
Critical Problem Occurrences = 4 Problems
Major Problem Occurrences = 0 Problems
Minor Problem Occurrences = 0 Problems
Warning Problem Occurrences = 0 Problems
Indeterminate Problem Occurrences = 0 Problems
Clear Problem Occurrences = 3 Problems
SA Total = 6 Alarms
Comuna = "VINA DEL MAR"
CatCliente = "CAV"
Nemonico = "BVLP7_MVALF9"
OPERATION_CONTEXT server:.oc_name alarm_object 3
On director: server:.temip.prd1149_director
AT Thu, Jan 16, 2014 10:33:45 PM All Attributes
Identifier = 3
State = Outstanding
Problem Status = Not-Handled
Clearance Report Flag = False
Escalated Alarm = False
Creation Timestamp = Thu, Jan 16, 2014 09:41:59 PM
Managed Object = NETACT server:.NETACT51 BSC 938189 BCF 61
Target Entities = { NETACT server:.NETACT51 BSC 938189 BCF 61 }
Alarm Type = EnvironmentalAlarm
Event Time = Thu, Jan 16, 2014 09:37:58 PM
Probable Cause = Indeterminate
Specific Problems = { 7405 }
Notification Identifier = 1757596347
Domain = Domain server:.netact51_dom
Alarm Origin = IncomingAlarm
Perceived Severity = Major
Additional Text = "NUSS FAILURE, RECTIFIER_1 ALARM
#S#10497405 ** ZONA TECNICA CENTR
Pelluhue Playa
PLMN-PLMN/BSC-938189/BCF-61
SC_logical_name:9679;"
Original Severity = Major
Original Event Time = Thu, Jan 16, 2014 09:37:58 PM
Outage Flag = False
Problem Occurrences = 1 Problems
GPP3 Problem Occurrences = 0 Problems
Critical Problem Occurrences = 0 Problems
Major Problem Occurrences = 1 Problems
Minor Problem Occurrences = 0 Problems
Warning Problem Occurrences = 0 Problems
Indeterminate Problem Occurrences = 0 Problems
Clear Problem Occurrences = 0 Problems
SA Total = 0 Alarms
Comuna = "PELLUHUE"
CatCliente = "UNIC_SITE"
Nemonico = "BTAL2_PYUEF6"
Thanks a lot in advance for any help you can give me!
Upvotes: 2
Views: 106
Reputation: 6204
The following doesn't address your script, but offers a line-by-line parsing approach:
use strict;
use warnings;
my ( $showHeader, $lastID, @header, @columns ) = ( 1, '' );
while (<>) {
if ( my ( $identifier, $statement ) = /^\s+(\S[^=]+)\s+=\s+(.+)/ ) {
if ( $identifier eq 'Managed Object'
and $lastID ne 'Clearance Time Stamp' )
{
push @header, 'Clearance Time Stamp' if $showHeader;
push @columns, '';
}
if ( $identifier eq 'Additional Text' ) {
while (<>) {
my ($additional) = /^\s+(\S.+)/ or next;
$statement .= $additional;
last if $additional =~ /SC_logical_name/;
}
$statement =~ s/\s+/ /g;
}
push @header, $identifier if $showHeader;
push @columns, $statement;
if ( $identifier eq 'Nemonico' ) {
if ($showHeader) {
print +( join ',', @header ), "\n";
$showHeader = 0;
}
print +( join ',', map { $_ = qq/"$_"/ if /,/ and !/^"/; $_ } @columns ), "\n";
undef @columns;
}
$lastID = $identifier;
}
}
Usage: perl script.pl inFile [>outFile.csv]
The last, optional parameter directs output to a file.
Multiple whitespaces are replaced by a single space in the field Additional Text
.
Hope this helps!
Upvotes: 2
Reputation: 561
Your style is certainly unusual but your script does seem to work. The problem is you are clobbering each found entry with the next one. You say that you 'need to skip the header separating each entry' but as far as I can see your script already does that, so maybe I'm not understanding that. Anyway, these changes should fix your problem 2:
my %columns;
my $current_entry = ''; # add this
while( $file =~
m{ ((?&Identifier))[\t ]*=[ \t]*((?&Statement)) $grammar}xgc ) {
# ...removed
for($value) { chomp; s/\n/\\n/g }
# add this check to separate each entry
if ($header eq 'Identifier ') {
$current_entry = $value;
}
$columns{$current_entry}{$header}=$value;
}
# need to change the way you print the results
# assumes there is always a Identifier: 1
# and that the first entry contains all possible headers
my $first = $columns{1};
my @headers = sort keys %$first;
print join "," => @headers; # Print column headers
print "\n";
for my $key (sort {$a <=> $b} keys %columns) {
my $entry = $columns{$key};
print join "," => map { $entry->{$_} } @headers; # Column content
print "\n";
}
Upvotes: 0
Reputation: 126722
Your Perl style is unusual and I find it very tricky to read, but you have treated the whole file as one long record. The header lines are ignored because they don't look like Identifer = Statement
.
That means your hash elements are left set to the last value found for each identifer - in general this is the contents of the final record.
I believe you would be much better off relying less on regular expressions. The way you have it now it is (as you have found) very difficult to debug.
Upvotes: 0