Dude85
Dude85

Reputation: 41

Perl regex expression meaning ? Does it catch the right thing, XML to TXT

I'm tasked with trying to update an old old script. Perl is not my strong suit, at all. The output of said script is some statistics, but I've noticed that in the output lines, it moves a tag down to a wrong device, I'm thinking it's the regex expression used, that somehow does this, but I'm unable to read it.

The script crunches a library of xml files and picks out specific tags, and prints them into a txt file, moves the xml files to a log dir and then deletes the xml that has been crunched.

  <ComputerStatus>
    <Name>PC1</Name>
    <VirusDefinitions>2019-06-23 rev. 001</VirusDefinitions>
    <LogonClient>Administrator</LogonClient>
    <IPAddress>192.168.2.2</IPAddress>
    <OperatingSystem>Windows Server 2008 R2 Standard Edition</OperatingSystem>
    <AutoProtectEnabled>1</AutoProtectEnabled>
    <AgentVersion>12.1.6168.6000</AgentVersion>
    <Status>1</Status>
    <LastUpdateTime>2019-06-25T09:53:19+12:00</LastUpdateTime>
    <Infected>0</Infected>
    <WorstInfectionIDX>9999</WorstInfectionIDX>
    <LastScanTime>2017-10-13T09:06:00+13:00</LastScanTime>
    <LastVirusTime>1970-01-01T00:00:00+13:00</LastVirusTime>
  </ComputerStatus>
  <ComputerStatus>
    <Name>PC2</Name>
    <VirusDefinitions>2019-06-23 rev. 001</VirusDefinitions>
    <LogonClient>Administrator</LogonClient>
    <IPAddress>192.168.2.3</IPAddress>
    <OperatingSystem>Windows Server 2012 R2 Standard Edition</OperatingSystem>
    <AutoProtectEnabled>1</AutoProtectEnabled>
    <AgentVersion>12.1.6168.6000</AgentVersion>
    <Status>1</Status>
    <LastUpdateTime>2019-06-25T09:54:59+12:00</LastUpdateTime>
    <Infected>0</Infected>
    <WorstInfectionIDX>9999</WorstInfectionIDX>
    <LastScanTime>2019-06-24T11:05:03+12:00</LastScanTime>
    <LastVirusTime>1970-01-01T00:00:00+13:00</LastVirusTime>
  </ComputerStatus>

This is the xml file from which I parse the stats. Its not all of it.

my @NAMES = ("Name", "VirusDefinitions", "IPAddress", "OperatingSystem", "AgentVersion", "Infected", "LastScanTime","LastUpdateTime","WorstInfectionIDX","LastVirusTime","Threats","StartDateTime","StopDateTime","TotalFiles","Duration","AutoProtectEnabled" );
my $DEBUG = 2; # debug on = 1 low, = 2 detailed, = 3 full, = 0 OFF. Will print to screen needed in file the pipe >filename
my $DETAILED = 0; #
#vars
my $path = $ARGV[0];
my $TXT = $ARGV[1];
open( FIL, "< $FILEA" )|| die "cant open file $!" ;
while (defined ($_ = <FIL>)) {
foreach my $N (@NAMES) {
#print "looking for $N\n" ;
    if  ($_ =~ /$N/) {
    if ($DEBUG gt 2){print "report: Looking for $N\n";}
    $_ =~ /$N(.*)$N/;
    my $TMP = $1;
    $TMP =~ s/[\$#@~!&*()<>\[\];,?^ `\\\/]+//g;
    #Switch that has to be extended if the array NAMES is extended
        if ($N eq "Name") {
        $NAME=$TMP; 
        }elsif ($N eq "VirusDefinitions"){ 
        $VIRUSDEF=$TMP;
        }elsif ($N eq "IPAddress") {
        $IP=$TMP;
        }elsif ($N eq "OperatingSystem") {
        $OS=$TMP;
        }elsif ($N eq "AgentVersion") {
        $AGNT=$TMP;
        }elsif ($N eq "Infected") {
        $INFEC=$TMP;
        }elsif ($N eq "LastScanTime") {
        $LAST=$TMP;
        }elsif ($N eq "LastUpdateTime"){
        $LASTUP=$TMP;
        }elsif ($N eq "WorstInfectionIDX") {
        $winfid=$TMP;
        }elsif ($N eq "LastVirusTime") {
        $lastvirust=$TMP;
        }elsif ($N eq "Threats"){
        $threats=$TMP;
        }elsif ($N eq "StartDateTime"){
        $starttime=$TMP;
        }elsif ($N eq "StopDateTime"){
        $stoptime=$TMP;
        }elsif ($N eq "TotalFiles"){
        $totalfil=$TMP;
        }elsif ($N eq "Duration"){
        $scandur=$TMP;
        }elsif ($N eq "AutoProtectEnabled") {
        $autoprot=$TMP;
        $CUST =~ s/\W//g; #We shave special characters off of the $CUST variable and return normal characters
        print LOG "$today $time, <LastScan><$LAST><LastUpdateTime><$LASTUP><ProjectNr><$PROJNO><Site><$CUST><Device><$NAME><ThreatsFound><$threats><Definition><$VIRUSDEF><IpAddress><$IP><OS><$OS><AgentVersion><$AGNT><Infected><$INFEC><WorstInfectionID><$winfid><LastVirusDetectionTime><$lastvirust><ScanStartTime><$starttime><ScanStopTime><$stoptime><ScanDuration><$scandur><AutoProtectionEnabled><$autoprot><FilesScanned><$totalfil><FileName><$FILE1>\n" ;
        #init variables again
        init
        }       
    }else{
    #print "no match $N\n"
    }

The above is the code bit that parses the xml file and prints it to a text. I'm aware of the code being prone to errors.

$TMP =~ s/[\$#@~!&*()<>\[\];,?^ `\\\/]+//g;

I'm thinking the issue lies with the expression above

190626 09:55:11, <LastScan><2019-06-19T22:36:04+02:00><LastUpdateTime><2019-06-20T20:58:17+02:00><ProjectNr><2><Site><redacted><Device><PC1><ThreatsFound><0><Definition><2019-06-23rev.001><IpAddress><192.168.2.2><OS><WindowsServer2008R2StandardEdition><AgentVersion><12.1.6168.6000><Infected><0><WorstInfectionID><9999><LastVirusDetectionTime><1970-01-01T00:00:00+01:00><ScanStartTime><2019-06-19T23:19:00+02:00><ScanStopTime><2019-06-19T23:25:35+02:00><ScanDuration><395><AutoProtectionEnabled><1><FilesScanned><130219><FileName><PerfMonSymantecEPM-20190625-AntiVirus.xml>
190626 09:55:11, <LastScan><2017-10-13T09:06:00+13:00><LastUpdateTime><2019-06-25T09:53:19+12:00><ProjectNr><2><Site><redacted><Device><PC2><ThreatsFound><0><Definition><2019-06-23rev.001><IpAddress><192.168.2.3><OS><WindowsServer2012R2StandardEdition><AgentVersion><12.1.6168.6000><Infected><0><WorstInfectionID><9999><LastVirusDetectionTime><1970-01-01T00:00:00+13:00><ScanStartTime><2019-06-19T23:19:00+02:00><ScanStopTime><2019-06-19T23:25:35+02:00><ScanDuration><395><AutoProtectionEnabled><1><FilesScanned><130219><FileName><PerfMonSymantecEPM-20190625-AntiVirus.xml>

The above is the text output file. As you can see the "LastScan" variable, is being printed on the wrong device. I've gone blind staring at the code and trying to figure out what the error is.

I'm by no means a perl expert, I do coding on a hobby scale in C#. So I'm hoping you, the experts, are able to help me out, i've tried to make the info as readable as possible

Upvotes: 0

Views: 39

Answers (1)

choroba
choroba

Reputation: 241968

It's a substitution, which generally looks like

s/PATTERN/REPLACEMENT/

The /g modifier means "global", i.e. it substitutes all occurrences of the pattern. The replacement is empty, so the substitution just removes all matches of the pattern.

s/[\$#@~!&*()<>\[\];,?^ `\\\/]+//g;
  ^                          ^^
  |                          |\
Beginning           End of the \
of a character   the character  One or more
class                   class   times

The pattern matches any sequence of the characters $#@~!&*()<>[];,?^ `\/. Some of the character are escaped (preceded by a backslash) to prevent their interpretation as special characters.

A bare $ would have been interpreted as a sigil (starting a variable name). The [ doesn't need escaping, but it doesn't hurt. ] would have been interpreted as the end of the character class. \ would have been interpreted as an escape character, and / would have been interpreted as the substitution delimiter.

Using regexes to process XML is fragile. Perl has several good XML parsing libraries which should be used instead (e.g. XML::LibXML or XML::Twig)

Upvotes: 1

Related Questions