gary69
gary69

Reputation: 4230

sed replace whitespace with underscore between 2 strings

I have a file that contains lines like this

some thing <phrase>a phrase</phrase> some thing else <phrase>other stuff</phrase>

I need to replace all the spaces between <phrase> tags with an underscore. So basically I need to replace every space that falls between > and </ with an underscore. I've tried many different commands in sed, awk, and perl but haven't been able to get anything to work. Below are some of the commands I've tried.

sed 's@>\s+[</]@_@g'

perl -pe 'sub c{$s=shift;$s=~s/ /_/g;$s}s/>.*?[<\/]/c$&/ge'

sed 's@\(\[>^[<\/]]*\)\s+@\1_@g'

awk -v RS='\\[>^[<\]/]*\\]' '{ gsub(/\<(\s+)\>/, "_", RT); printf "%s%s", $0, RT }' infile

I've been looking at these 2 questions trying to modify the answers to use the characters I need.
sed substitute whitespace for dash only between specific character patterns

https://unix.stackexchange.com/questions/63335/how-to-remove-all-white-spaces-just-between-brackets-using-unix-tools

Can anyone please help?

Upvotes: 3

Views: 1374

Answers (6)

user7712945
user7712945

Reputation:

if your data in 'd' by gnu sed;

sed -E ':b s/<(\w+)>([^<]*)\s([^<]*)(<\/\1)/<\1>\2_\3\4/;tb' d

Upvotes: 1

stack0114106
stack0114106

Reputation: 8711

Another Perl, replacing between the <phrase> tags

$ export a="some thing <phrase>a phrase</phrase> some thing else <phrase>other stuff</phrase>"

$ echo $a | perl -lne ' s/(?<=<phrase>)(.+?)(?=<\/phrase>)/$x=$1;$x=~s{ }{_}g;sprintf("%s",$x)/ge ;  print '
some thing <phrase>a_phrase</phrase> some thing else <phrase>other_stuff</phrase>

$

EDIT

Thanks @haukex, shortening further

$ echo $a | perl -lne ' s/(?<=<phrase>)(.+?)(?=<\/phrase>)/$x=$1;$x=~s{ }{_}g;$x/ge ;  print '
some thing <phrase>a_phrase</phrase> some thing else <phrase>other_stuff</phrase>

$

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 203491

With GNU awk for multi-char RS and RT:

$ awk -v RS='</?phrase>' '!(NR%2){gsub(/\s+/,"_")} {ORS=RT}1' file
some thing <phrase>a_phrase</phrase> some thing else <phrase>other_stuff</phrase>

Upvotes: 1

potong
potong

Reputation: 58400

This might work for you (GNU sed):

sed -E 's/<phrase>|<\/phrase>/\n&/g;ta;:a;s/^([^\n]*(\n[^\n ]*\n[^\n]*)*\n[^\n]*) /\1_/;ta;s/\n//g' file

Delimit tags by inserting newlines. Iteratively substitute spaces between pairs of newlines with underscores. When there are no more matches, remove the introduced newlines.

Upvotes: 1

haukex
haukex

Reputation: 3013

Don't use regular expressions to parse XML/HTML.

use warnings;
use 5.014;  # for /r modifier
use Mojo::DOM;

my $text = <<'ENDTEXT';
some thing <phrase>a phrase</phrase> some thing else <phrase>other stuff</phrase>
ENDTEXT

my $dom = Mojo::DOM->new($text);
$dom->find('phrase')->each(sub { $_->content( $_->content=~tr/ /_/r ) });
print $dom;

Output:

some thing <phrase>a_phrase</phrase> some thing else <phrase>other_stuff</phrase>

Update: Mojolicious even contains some sugar that allows smashing that code into a oneliner:

$ perl -Mojo -pe '($_=x($_))->find("phrase")->each(sub{$_->content($_->content=~tr/ /_/r)})' input.txt

Upvotes: 5

melpomene
melpomene

Reputation: 85767

I need to replace every space that falls between > and </ with an underscore.

That won't actually do what you want because e.g. in

some thing <phrase>a phrase</phrase> some thing else <phrase>other stuff</phrase>
                  ^^^^^^^^^^^      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

the substrings "between > and </" cover more than you think (marked ^ above).

I think the most straightforward way to express your requirements in Perl is

perl -pe 's{>[^<>]*</}{ $& =~ tr/ /_/r }eg'

Here [^<>] is used to make sure that the matched substring cannot contain < or > (in particular, it cannot match other <phrase> tags).

If that's too readable, you can also do

perl '-pes;>[^<>]*</;$&=~y> >_>r;eg'

Upvotes: 2

Related Questions