atif
atif

Reputation: 1147

regex match for a string

I am doing a perl script which will do some formatting to an xml file. I need some help when it comes to ignoring white space before the opening of any xml tag. I have the following xml file

test.xml

   <xml>
      <TI>Definitions, Exemptions and Rebates "where"  


    <VARPARA><VAR>E</VAR></VARPARA></TI>
   </xml>  

I want a regex expression which will replace any whitespaces including extra spaces and new line characters before the opening of any xml tag with a single space, so in the above case <VARPARA> is the tag which has some white spaces and new line character after "where".

I was thinking something along the lines of

$s =~ s/\s*</ </ig; 

but here it will look at the opening tag < only, whereas I want to check both the opening < and closing tag > as well so

    <VARPARA>

.

The output string should look like below

    <xml>
      <TI>Definitions, Exemptions and Rebates "where" <VARPARA><VAR>E</VAR></VARPARA></TI>
   </xml>  

Upvotes: 0

Views: 90

Answers (3)

atif
atif

Reputation: 1147

This is how I handle it.

$s =~ s/\s+(?= \<\w+>)/ /xig;

Upvotes: 0

clt60
clt60

Reputation: 63892

I'm not an regex expert, so this probably will fail in some scenarios, but according to your last comment try the next:

echo '<xml>
      <TI>Definitions, Exemptions and Rebates "where"  


    <VARPARA><VAR>E</VAR></VARPARA></TI>

<TI>Definitions, Exemptions and Rebates "where"  


    <VARPARA><VAR>E</VAR></VARPARA></TI>
</xml>' | perl -0777 -pE 's/(\S)(\s+)(<\w+?>)/$1 $3/g;s/> +</>\n</g'
<xml>
<TI>Definitions, Exemptions and Rebates "where" <VARPARA><VAR>E</VAR></VARPARA></TI>
<TI>Definitions, Exemptions and Rebates "where" <VARPARA><VAR>E</VAR></VARPARA></TI>
</xml>

Upvotes: 0

ikegami
ikegami

Reputation: 385496

To determine if < is the start of a tag, you'd have to find out if it's in comment, in a CDATA section, etc. You need more than a regex. I recommend using an existing parser.

use XML::LibXML qw( );

my $parser = XML::LibXML->new();
my $doc = $parser->parse_file($qfn);

for my $text_node ($doc->findnodes('//text()')) {
   my $text = $text_node->data();
   next if $text =~ /^\s+\z/;

   my $next_node = $text_node->nextSibling();
   next if !$next_node;

   $text =~ s/\s+\z/ /;
   $text_node->setData($text);
}

$doc->toFile($qfn);

Upvotes: 2

Related Questions