Reputation: 1147
I am doing a perl script which will do some formatting to an xml file. I need some help when it comes to ignoring white space before the opening of any xml tag. I have the following xml file
test.xml
<xml>
<TI>Definitions, Exemptions and Rebates "where"
<VARPARA><VAR>E</VAR></VARPARA></TI>
</xml>
I want a regex expression which will replace any whitespaces including extra spaces and new line characters before the opening of any xml tag with a single space, so in the above case <VARPARA>
is the tag which has some white spaces and new line character after "where".
I was thinking something along the lines of
$s =~ s/\s*</ </ig;
but here it will look at the opening tag <
only, whereas I want to check both the opening <
and closing tag >
as well so
<VARPARA>
.
The output string should look like below
<xml>
<TI>Definitions, Exemptions and Rebates "where" <VARPARA><VAR>E</VAR></VARPARA></TI>
</xml>
Upvotes: 0
Views: 90
Reputation: 63892
I'm not an regex expert, so this probably will fail in some scenarios, but according to your last comment try the next:
echo '<xml>
<TI>Definitions, Exemptions and Rebates "where"
<VARPARA><VAR>E</VAR></VARPARA></TI>
<TI>Definitions, Exemptions and Rebates "where"
<VARPARA><VAR>E</VAR></VARPARA></TI>
</xml>' | perl -0777 -pE 's/(\S)(\s+)(<\w+?>)/$1 $3/g;s/> +</>\n</g'
<xml>
<TI>Definitions, Exemptions and Rebates "where" <VARPARA><VAR>E</VAR></VARPARA></TI>
<TI>Definitions, Exemptions and Rebates "where" <VARPARA><VAR>E</VAR></VARPARA></TI>
</xml>
Upvotes: 0
Reputation: 385496
To determine if <
is the start of a tag, you'd have to find out if it's in comment, in a CDATA section, etc. You need more than a regex. I recommend using an existing parser.
use XML::LibXML qw( );
my $parser = XML::LibXML->new();
my $doc = $parser->parse_file($qfn);
for my $text_node ($doc->findnodes('//text()')) {
my $text = $text_node->data();
next if $text =~ /^\s+\z/;
my $next_node = $text_node->nextSibling();
next if !$next_node;
$text =~ s/\s+\z/ /;
$text_node->setData($text);
}
$doc->toFile($qfn);
Upvotes: 2