drobert
drobert

Reputation: 95

Replace strings only within a regex match in perl

I have an XML document with text in attribute values. I can't change how the the XML file is generated, but need to extract the attribute values without loosing \r\n. The XML parser of course strips them out.

So I'm trying to replace \r\n in attribute values with entity references I'm using perl to do this because of it's non-greedy matching. But I need help getting the replace to happen only within the match. Or I need an easier way to do this :)

Here's is what I have so far:

perl -i -pe 'BEGIN{undef $/;} s/m_description="(.*?)"/m_description="$1"/smg' tmp.xml

This matches what I need to work with: (.*?). But I don't know to expand that pattern to match \r\n inside it, and do the replacement in the results. If I knew how many \r\n I have I could do it, but it seems I need a variable number of capture groups or something like that? There's a lot to regex I don't understand and it seems like there should be something do do this.

Example:

preceding lines 
stuff m_description="Over
any number
of lines" other stuff
more lines

Should go to:

preceding lines 
stuff m_description="Over
any number
of lines" other stuff
more lines

Solution

Thanks to Ikegam and ysth for the solution I used, which for 5.14+ is:

perl -i -0777 -pe's/m_description="\K(.*?)(?=")/ $1 =~ s!\n!
!gr =~ s!\r!
!gr /sge' tmp.xml

Upvotes: 0

Views: 738

Answers (2)

Sobrique
Sobrique

Reputation: 53478

OK, so whilst this looks like an XML problem, it isn't. The XML problem is the person generating it. You should probably give them a prod with a rolled up copy of the spec as your first port of call for "fixing" this.

But failing that - I'd do a two pass approach, where I read the text, find all the 'blobs' that match a description, and then replace them all.

Something like this:

#!/usr/bin/env perl

use strict;
use warnings;

use Data::Dumper;

my $text = do { local $/ ;  <DATA> }; 

#filter text for 'description' text: 
my @matches = $text =~ m{m_description=\"([^\"]+)\"}gms;

print Dumper \@matches; 

#Generate a search-and-replace hash
my %replace = map { $_ => s/[\r\n]+/&#13;&#10;/gr } @matches; 
print Dumper \%replace;

#turn the keys of that hash into a search regex
my $search = join ( "|", keys %replace ); 
   $search = qr/\"($search)\"/ms; 

print "Using search regex: $search\n";
#search and replace text block
$text =~ s/m_description=$search/m_description="$replace{$1}"/mgs;

print "New text:\n";
print $text;

__DATA__
preceding lines 
stuff m_description="Over
any number
of lines" other stuff
more lines

Upvotes: 0

ysth
ysth

Reputation: 98398

. should already match \n (because you specify the /s flag) and \r.

To do the replacement in the results, use /e:

perl -i -0777 -pe's/(?<=m_description=")(.*?)(?=")/ my $replacement=$1; $replacement=~s!\n!&#10;!g; $replacement=~s!\r!&#13;!g; $replacement /sge' tmp.xml

I've also changed it to use lookbehind/lookahead to make the code simpler and to use -0777 to set $/ to slurp mode and to remove the useless /m.

Upvotes: 2

Related Questions