Replace several occurences of the same character in a different way in AWK

Question

I want to replace several characters in a csv file depending on the characters around them using AWK.

For example in this line:

"Example One; example one; EXAMPLE ONE; E. EXAMPLE One"

I would like to replace all capital "E"'s with "EE" if they are within a word that uses only capitals and with "Ee" if they are in a word with upper and lower case letters or in an abbreviation (like the E., it's an adress file so there are no cases where this could also be the end of a sentence) so it should look like this:

"Eexample One; example one; EEXAMPLEE ONEE; Ee. EEXAMPLEE One"

Now what I have tried is this:

{if ($0 ~/E[A-Z]+/)
    $0 = gensub(/E/,"EE","g",$0)
else if ($0 ~/[A-Z]E/)
    $0 = gensub(/E/,"EE","g",$0)
else
    $0 = gensub(/E/,"Ee","g",$0)
}

This works fine in most cases, but for lines (or fieds for that matter) that contain several "E"'s where I'd want one to be replaced as a "Ee" and one as a "EE" like in "E. EXAMPLE One", it matches the E in "EXAMPLE" and just replaces all "E"'s in that line with "EE".

Is there a better way to do this? Can I maybe somehow use if within gensub?

ps: Hope this makes sense, I just started learning the basics of programming!

Ed Morton · Accepted Answer

$ cat tst.awk
{
    head = ""
    tail = $0
    while ( match(tail,/[[:alpha:]]+\.?/) ) {
        tgt = substr(tail,RSTART,RLENGTH)
        add = (tgt ~ /^[[:upper:]]+$/ ? "E" : "e")
        gsub(/E/,"&"add,tgt)
        head = head substr(tail,1,RSTART-1) tgt
        tail = substr(tail,RSTART+RLENGTH)
    }
    print head tail
}

$ awk -f tst.awk file
Eexample One; example one; EEXAMPLEE ONEE; Ee. EEXAMPLEE One

It's not clear though how you distinguish a string of letters followed by a period as an abbreviation or just the end of a sentence.

Replace several occurences of the same character in a different way in AWK

Answers (1)

Related Questions