Reputation: 95
I have an XML document with text in attribute values. I can't change how the the XML file is generated, but need to extract the attribute values without loosing \r\n. The XML parser of course strips them out.
So I'm trying to replace \r\n in attribute values with entity references I'm using perl to do this because of it's non-greedy matching. But I need help getting the replace to happen only within the match. Or I need an easier way to do this :)
Here's is what I have so far:
perl -i -pe 'BEGIN{undef $/;} s/m_description="(.*?)"/m_description="$1"/smg' tmp.xml
This matches what I need to work with: (.*?). But I don't know to expand that pattern to match \r\n inside it, and do the replacement in the results. If I knew how many \r\n I have I could do it, but it seems I need a variable number of capture groups or something like that? There's a lot to regex I don't understand and it seems like there should be something do do this.
Example:
preceding lines
stuff m_description="Over
any number
of lines" other stuff
more lines
Should go to:
preceding lines
stuff m_description="Over any number of lines" other stuff
more lines
Solution
Thanks to Ikegam and ysth for the solution I used, which for 5.14+ is:
perl -i -0777 -pe's/m_description="\K(.*?)(?=")/ $1 =~ s!\n! !gr =~ s!\r! !gr /sge' tmp.xml
Upvotes: 0
Views: 738
Reputation: 53478
OK, so whilst this looks like an XML problem, it isn't. The XML problem is the person generating it. You should probably give them a prod with a rolled up copy of the spec as your first port of call for "fixing" this.
But failing that - I'd do a two pass approach, where I read the text, find all the 'blobs' that match a description, and then replace them all.
Something like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my $text = do { local $/ ; <DATA> };
#filter text for 'description' text:
my @matches = $text =~ m{m_description=\"([^\"]+)\"}gms;
print Dumper \@matches;
#Generate a search-and-replace hash
my %replace = map { $_ => s/[\r\n]+/ /gr } @matches;
print Dumper \%replace;
#turn the keys of that hash into a search regex
my $search = join ( "|", keys %replace );
$search = qr/\"($search)\"/ms;
print "Using search regex: $search\n";
#search and replace text block
$text =~ s/m_description=$search/m_description="$replace{$1}"/mgs;
print "New text:\n";
print $text;
__DATA__
preceding lines
stuff m_description="Over
any number
of lines" other stuff
more lines
Upvotes: 0
Reputation: 98398
.
should already match \n
(because you specify the /s
flag) and \r
.
To do the replacement in the results, use /e
:
perl -i -0777 -pe's/(?<=m_description=")(.*?)(?=")/ my $replacement=$1; $replacement=~s!\n! !g; $replacement=~s!\r! !g; $replacement /sge' tmp.xml
I've also changed it to use lookbehind/lookahead to make the code simpler and to use -0777 to set $/
to slurp mode and to remove the useless /m
.
Upvotes: 2