Brandon
Brandon

Reputation: 13

Identifying XML declaration in a Perl IF statement

I am trying to add a stylesheet declaration to the second line of any XML file my script processes. My script reads the file line by line into the $inputline string within a loop.

I have the following poorly-written Perl code:

while(<INPUT>) {

$inputline = $_;

if ($inputline =~ m/\<\?xml\ version\=\"1\.0\"\ encoding\=\"UTF-8\"\?\>/){

print OUTPUT "\<\?xml version\=\"1.0\" encoding\=\"UTF-8\"\?\>\n";
print OUTPUT "\<\?xml\-stylesheet type\=\"text\/xsl\" href\=\"askaway_transcript_stylesheet\.xsl\"\?\>\n";
}

#lots of other processing stuff
}

And I think this worked once, but it no longer does. Testing different output and tweaking things tells me that the IF statement is failing, and I've probably done something wrong there.

Any tips?

Upvotes: 1

Views: 256

Answers (1)

Jeff B
Jeff B

Reputation: 30099

You have a very rigid regex to find the XML header. What if there are extra spaces? What if the encoding is different, or the xml version? Regex is not the right tool for parsing XML/HTML (see this answer), however it is understandable why you would want to use regex, especially given the limited scope of what you are trying to do.

That being said, if you are going for simplicity, and you are willing to be open to some possible failures, I would opt for a simpler regex and only do the replacement once:

my $replaced = 0;
if ($inputline =~ m/\<\?xml\b.*\>/ && !$replaced) {

    print OUTPUT $inputline;
    print OUTPUT '<?xml-stylesheet type="text/xsl" href="askaway_transcript_stylesheet.xsl"?>'."\n";

    $replaced = 1;
}

Alternately, you could exit your parse loop, assuming that is all you are doing in the loop.

Caveat:

  • If your XML is all written on one line, or even if there is another tag on the same line (which is legal), this will most likely break your XML.

Edit:

Your entire while loop would probably look like this:

while($inputline = <MYXML>) {
    my $replaced = 0;
    if ($inputline =~ m/\<\?xml\b.*\>/ && !$replaced) {

        print OUTPUT $inputline;
        print OUTPUT '<?xml-stylesheet type="text/xsl" href="askaway_transcript_stylesheet.xsl"?>'."\n";

        $replaced = 1;
    } else {
        print OUTPUT $inputline;
    }
}

Or:

while($inputline = <MYXML>) {
    my $replaced = 0;

    print OUTPUT $inputline;

    if ($inputline =~ m/\<\?xml\b.*\>/ && !$replaced) {
        print OUTPUT '<?xml-stylesheet type="text/xsl" href="askaway_transcript_stylesheet.xsl"?>'."\n";

        $replaced = 1;
    }
}

Upvotes: 1

Related Questions