Reputation: 4230
I have a file that contains lines like this
some thing <phrase>a phrase</phrase> some thing else <phrase>other stuff</phrase>
I need to replace all the spaces between <phrase>
tags with an underscore. So basically I need to replace every space that falls between >
and </
with an underscore. I've tried many different commands in sed, awk, and perl but haven't been able to get anything to work. Below are some of the commands I've tried.
sed 's@>\s+[</]@_@g'
perl -pe 'sub c{$s=shift;$s=~s/ /_/g;$s}s/>.*?[<\/]/c$&/ge'
sed 's@\(\[>^[<\/]]*\)\s+@\1_@g'
awk -v RS='\\[>^[<\]/]*\\]' '{ gsub(/\<(\s+)\>/, "_", RT); printf "%s%s", $0, RT }' infile
I've been looking at these 2 questions trying to modify the answers to use the characters I need.
sed substitute whitespace for dash only between specific character patterns
Can anyone please help?
Upvotes: 3
Views: 1374
Reputation:
if your data in 'd' by gnu sed;
sed -E ':b s/<(\w+)>([^<]*)\s([^<]*)(<\/\1)/<\1>\2_\3\4/;tb' d
Upvotes: 1
Reputation: 8711
Another Perl, replacing between the <phrase>
tags
$ export a="some thing <phrase>a phrase</phrase> some thing else <phrase>other stuff</phrase>"
$ echo $a | perl -lne ' s/(?<=<phrase>)(.+?)(?=<\/phrase>)/$x=$1;$x=~s{ }{_}g;sprintf("%s",$x)/ge ; print '
some thing <phrase>a_phrase</phrase> some thing else <phrase>other_stuff</phrase>
$
EDIT
Thanks @haukex, shortening further
$ echo $a | perl -lne ' s/(?<=<phrase>)(.+?)(?=<\/phrase>)/$x=$1;$x=~s{ }{_}g;$x/ge ; print '
some thing <phrase>a_phrase</phrase> some thing else <phrase>other_stuff</phrase>
$
Upvotes: 1
Reputation: 203491
With GNU awk for multi-char RS and RT:
$ awk -v RS='</?phrase>' '!(NR%2){gsub(/\s+/,"_")} {ORS=RT}1' file
some thing <phrase>a_phrase</phrase> some thing else <phrase>other_stuff</phrase>
Upvotes: 1
Reputation: 58400
This might work for you (GNU sed):
sed -E 's/<phrase>|<\/phrase>/\n&/g;ta;:a;s/^([^\n]*(\n[^\n ]*\n[^\n]*)*\n[^\n]*) /\1_/;ta;s/\n//g' file
Delimit tags by inserting newlines. Iteratively substitute spaces between pairs of newlines with underscores. When there are no more matches, remove the introduced newlines.
Upvotes: 1
Reputation: 3013
Don't use regular expressions to parse XML/HTML.
use warnings;
use 5.014; # for /r modifier
use Mojo::DOM;
my $text = <<'ENDTEXT';
some thing <phrase>a phrase</phrase> some thing else <phrase>other stuff</phrase>
ENDTEXT
my $dom = Mojo::DOM->new($text);
$dom->find('phrase')->each(sub { $_->content( $_->content=~tr/ /_/r ) });
print $dom;
Output:
some thing <phrase>a_phrase</phrase> some thing else <phrase>other_stuff</phrase>
Update: Mojolicious even contains some sugar that allows smashing that code into a oneliner:
$ perl -Mojo -pe '($_=x($_))->find("phrase")->each(sub{$_->content($_->content=~tr/ /_/r)})' input.txt
Upvotes: 5
Reputation: 85767
I need to replace every space that falls between
>
and</
with an underscore.
That won't actually do what you want because e.g. in
some thing <phrase>a phrase</phrase> some thing else <phrase>other stuff</phrase>
^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
the substrings "between >
and </
" cover more than you think (marked ^
above).
I think the most straightforward way to express your requirements in Perl is
perl -pe 's{>[^<>]*</}{ $& =~ tr/ /_/r }eg'
Here [^<>]
is used to make sure that the matched substring cannot contain <
or >
(in particular, it cannot match other <phrase>
tags).
If that's too readable, you can also do
perl '-pes;>[^<>]*</;$&=~y> >_>r;eg'
Upvotes: 2