Reputation: 1789

sed with simultaneous and sequential replace

I'm not sure this is possible to do what I want in sed (or awk or any bash tool):

I want to make a script that replaces : ) in a string by <happy> and ) : by <sad>. This can easily be done with sed with:

echo "test : )" | sed 's/: )/<happy>/g'
echo "test ) :" | sed 's/) :/<sad>/g'

Unfortunately, sometimes I have strings like these:

I'm happy : ) : ) : )
I'm sad ) : ) : ) :

In that case, the output should be:

I'm happy <happy> <happy> <happy>
I'm sad <sad> <sad> <sad>

But by combining the two commands above:

echo "I'm happy : ) : ) : )" | sed 's/: )/<happy>/g' | sed 's/) :/<sad>/g'
echo "I'm sad ) : ) : ) :" | sed 's/: )/<happy>/g' | sed 's/) :/<sad>/g'

I will get:

I'm happy <happy> <happy> <happy>
I'm sad ) <happy> <happy> :

The way to solve this would be to do both replacements in parallel, by treating the string from left to right. I tried to use something like this: sed 's/a/b/g;s/c/d/g' but the replacement is only done one pattern after one other, and doesn't solve the problem.

Upvotes: 9

Answers (4)

Toby Speight

Reputation: 30910

We can solve this problem in two passes:

Identify the replaceable strings, and mark them with delimiters (I'll use ! for both start and end, but you can use almost anything).
Now replace those delimited strings separately.

Here's a sed program that implements this approach:

#!/bin/sed -f

s/) :\|: )/!&!/g


s/!: )!/<happy>/g
s/!) :!/<sad>/g

A note on the the delimiters:

We can use any delimiter we want for this, as we always re-match and replace the delimiters we introduce. This isn't the case in all sed scripts, and as a general rule it can be a good idea to use \n as delimiter (if you're processing single lines) or another unlikely character (perhaps \0 or \377 if you're processing ordinary text).

We can use any character in this script. For example, using a and b works just as well:

#!/bin/sed -f

s/) :\|: )/a&b/g

s/a: )b/<happy>/g
s/a) :b/<sad>/g

$ sed -f ../stackoverflow/51886023.sed <<<$'I\'m happy : ) : ) : )\nI\'m sad ) : ) : ) :'

I'm happy <happy> <happy> <happy>
I'm sad <sad> <sad> <sad>

Upvotes: 1

Jon

Reputation: 3671

If you have Perl available, it does a good job of this problem. Its e option on substitutions makes the code short and - for Perl - tidy.

my %map = (
    ": )" => "<happy>",
    ") :" => "<sad>",
);

while (<>) {
    s/\: \)|\) \:/$map{$&}/ge;
    print;
}

The general case - where the regular expression is built from the map - is solved in the script below. The subtlety in Perl is that its regular expression engine matches the first matching pattern in an | alternation. The upshot is that the alternatives need to be sorted longest to shortest, otherwise, in the example below, : )) might get matched by : ).

$ cat script.pl
#!/usr/bin/perl -w

use strict;

my %map = (
    ": )" => "<happy>",
    ") :" => "<sad>",
    ": |" => "<meh>",
    ": ))" => "<really happy>",
);

my @map_regexes = keys %map;
my @map_regexes_longest_first = reverse sort @map_regexes;
my @quoted_map_regexes = map(quotemeta, @map_regexes_longest_first);
my $map_regex = join("|", @quoted_map_regexes);

while (<>) {
    s/$map_regex/$map{$&}/ge;
    print;
}
$ cat file.txt
I'm happy : ) : ) : )
I'm sad ) : ) : ) :
I'm meh : | : | : |
I'm really happy : )) : )) : ))
$ perl -w script.pl <file.txt
I'm happy <happy> <happy> <happy>
I'm sad <sad> <sad> <sad>
I'm meh <meh> <meh> <meh>
I'm really happy <really happy> <really happy> <really happy>

Upvotes: 4

Sundeep

Reputation: 23677

For given sample (i.e dealing with two overlapping matches), one can use looping and solve with sed as well

$ cat ip.txt
I am happy : ) : ) : )
I am sad ) : ) : ) :
: ) : ) : )
) : ) : ) :
) : : ) :
: ) ) :

$ # GNU version: sed -E -e ':a s/(^|[^)].): \)/\1<happy>/g; ta' -e 's/\) :/<sad>/g'
$ sed -E -e ':a' -e 's/(^|[^)].): \)/\1<happy>/g' -e 'ta' -e 's/\) :/<sad>/g' ip.txt
I am happy <happy> <happy> <happy>
I am sad <sad> <sad> <sad>
<happy> <happy> <happy>
<sad> <sad> <sad>
<sad> <happy> :
<happy> <sad>

-e ':a' label a
s/(^|[^)].): \)/\1<happy>/g replace : ) with <happy> as long as 2nd character before it is not )
-e 'ta' branch to label a if there was successful substitution - looping is required because we have to check 4 characters for one replacement of 2 characters
s/\) :/<sad>/g once all the happy emojis are replaced, we can change all the sad emojis in one go

For multiple mappings, here's a perl solution similar to the awk one

$ perl -pe 'BEGIN{ $h{": )"}="<happy>"; $h{") :"}="<sad>";
                   $r = join "|", map quotemeta, keys %h; }
            s/$r/$h{$&}/g' ip.txt
I am happy <happy> <happy> <happy>
I am sad <sad> <sad> <sad>
<happy> <happy> <happy>
<sad> <sad> <sad>
<sad> <happy> :
<happy> <sad>

$h{": )"}="<happy>" create hash of key-value pairs
$r = join "|", map quotemeta, keys %h create regex alternation from all the keys of hash %h... map quotemeta will escape all characters other than [A-Za-z_0-9] for each hash key
s/$r/$h{$&}/g search and replace

Upvotes: 2

Ed Morton

Reputation: 204164

With GNU awk for the 3rd arg to match():

$ cat script1.awk
BEGIN {
    map[": )"] = "<happy>"
    map[") :"] = "<sad>"
}
{
    while ( match($0,/(.*)(: \)|\) :)(.*)/,a) ) {
        $0 = a[1] map[a[2]] a[3]
    }
    print
}

$ awk -f script1.awk file
I'm happy <happy> <happy> <happy>
I'm sad <sad> <sad> <sad>

With any awk:

$ cat script2.awk
BEGIN {
    map[": )"] = "<happy>"
    map[") :"] = "<sad>"
}
{
    while ( match($0,/: \)|\) :/) ) {
        $0 = substr($0,1,RSTART-1) map[substr($0,RSTART,RLENGTH)] substr($0,RSTART+RLENGTH)
    }
    print
}

$ awk -f script2.awk file
I'm happy <happy> <happy> <happy>
I'm sad <sad> <sad> <sad>

Although both approaches produce the same output in this case, the first approach actually works from the end of the string to the front courtesy of the leading .* while the second approach works front to back. You can see that with this test:

$ echo ': ) :' | awk -f script1.awk
: <sad>

$ echo ': ) :' | awk -f script2.awk
<happy> :

You can do a back-to-front pass with any awk with a tweak but I don't think that's what you really want anyway.

Edit to build the regexp from the map:

$ cat tst.awk
BEGIN {
    map[": )"] = "<happy>"
    map[") :"] = "<sad>"
    for (emoji in map) {
        gsub(/[^^]/,"[&]",emoji)
        gsub(/\^/,"\\^",emoji)
        emojis = (emojis == "" ? "" : emojis "|") emoji
    }
}
{
    while ( match($0,emojis) ) {
        $0 = substr($0,1,RSTART-1) map[substr($0,RSTART,RLENGTH)] substr($0,RSTART+RLENGTH)
    }
    print
}

$ awk -f tst.awk file
I'm happy <happy> <happy> <happy>
I'm sad <sad> <sad> <sad>

Upvotes: 5

sed with simultaneous and sequential replace

Answers (4)

Related Questions