Reputation: 4250

sed replace whitespace with underscore between 2 strings

I have a file that contains lines like this

some thing <phrase>a phrase</phrase> some thing else <phrase>other stuff</phrase>

I need to replace all the spaces between <phrase> tags with an underscore. So basically I need to replace every space that falls between > and </ with an underscore. I've tried many different commands in sed, awk, and perl but haven't been able to get anything to work. Below are some of the commands I've tried.

sed 's@>\s+[</]@_@g'

perl -pe 'sub c{$s=shift;$s=~s/ /_/g;$s}s/>.*?[<\/]/c$&/ge'

sed 's@$\[>^[<\/]]*$\s+@\1_@g'

awk -v RS='\\[>^[<\]/]*\\]' '{ gsub(/\<(\s+)\>/, "_", RT); printf "%s%s", $0, RT }' infile

I've been looking at these 2 questions trying to modify the answers to use the characters I need.
sed substitute whitespace for dash only between specific character patterns

https://unix.stackexchange.com/questions/63335/how-to-remove-all-white-spaces-just-between-brackets-using-unix-tools

Can anyone please help?

Upvotes: 3

Answers (6)

user7712945

Reputation:

if your data in 'd' by gnu sed;

sed -E ':b s/<(\w+)>([^<]*)\s([^<]*)(<\/\1)/<\1>\2_\3\4/;tb' d

Upvotes: 1

stack0114106

Reputation: 8791

Another Perl, replacing between the <phrase> tags

$ export a="some thing <phrase>a phrase</phrase> some thing else <phrase>other stuff</phrase>"

$ echo $a | perl -lne ' s/(?<=<phrase>)(.+?)(?=<\/phrase>)/$x=$1;$x=~s{ }{_}g;sprintf("%s",$x)/ge ;  print '
some thing <phrase>a_phrase</phrase> some thing else <phrase>other_stuff</phrase>

$

EDIT

Thanks @haukex, shortening further

$ echo $a | perl -lne ' s/(?<=<phrase>)(.+?)(?=<\/phrase>)/$x=$1;$x=~s{ }{_}g;$x/ge ;  print '
some thing <phrase>a_phrase</phrase> some thing else <phrase>other_stuff</phrase>

$

Upvotes: 1

Ed Morton

Reputation: 204638

With GNU awk for multi-char RS and RT:

$ awk -v RS='</?phrase>' '!(NR%2){gsub(/\s+/,"_")} {ORS=RT}1' file
some thing <phrase>a_phrase</phrase> some thing else <phrase>other_stuff</phrase>

Upvotes: 1

potong

Reputation: 58578

This might work for you (GNU sed):

sed -E 's/<phrase>|<\/phrase>/\n&/g;ta;:a;s/^([^\n]*(\n[^\n ]*\n[^\n]*)*\n[^\n]*) /\1_/;ta;s/\n//g' file

Delimit tags by inserting newlines. Iteratively substitute spaces between pairs of newlines with underscores. When there are no more matches, remove the introduced newlines.

Upvotes: 1

haukex

Reputation: 3013

Don't use regular expressions to parse XML/HTML.

use warnings;
use 5.014;  # for /r modifier
use Mojo::DOM;

my $text = <<'ENDTEXT';
some thing <phrase>a phrase</phrase> some thing else <phrase>other stuff</phrase>
ENDTEXT

my $dom = Mojo::DOM->new($text);
$dom->find('phrase')->each(sub { $_->content( $_->content=~tr/ /_/r ) });
print $dom;

Output:

some thing <phrase>a_phrase</phrase> some thing else <phrase>other_stuff</phrase>

Update: Mojolicious even contains some sugar that allows smashing that code into a oneliner:

$ perl -Mojo -pe '($_=x($_))->find("phrase")->each(sub{$_->content($_->content=~tr/ /_/r)})' input.txt

Upvotes: 5

melpomene

Reputation: 85897

I need to replace every space that falls between > and </ with an underscore.

That won't actually do what you want because e.g. in

some thing <phrase>a phrase</phrase> some thing else <phrase>other stuff</phrase>
                  ^^^^^^^^^^^      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

the substrings "between > and </" cover more than you think (marked ^ above).

I think the most straightforward way to express your requirements in Perl is

perl -pe 's{>[^<>]*</}{ $& =~ tr/ /_/r }eg'

Here [^<>] is used to make sure that the matched substring cannot contain < or > (in particular, it cannot match other <phrase> tags).

If that's too readable, you can also do

perl '-pes;>[^<>]*</;$&=~y> >_>r;eg'

Upvotes: 2

sed replace whitespace with underscore between 2 strings

Answers (6)

Related Questions