i_am_so_stupid
i_am_so_stupid

Reputation: 305

Maintaining plurality and proper capitalization of replaced words using sed

I realize that title is horrible but anyways, I have an assignment to change all instances of "cat" to "dog" using sed. Simple enough but it also includes words like "catapult" and "bearcat" which I tried to avoid by putting a space in the code. My problem is that every word becomes "dog" where certain instances I want it to be "Dog" or "dogs"...

Here's the text file I'm changing:

Dear Homeowner,

Cats are important to people. We all enjoy the company of cats. If you have ever wanted to own a cat we can help. We are attempting to hold a “cat comes home” day for our city. To help us we've enlisted the NWMSU Bearcat cheerleaders, the organizers of the city's annual catapult toss, and local celebrities to help find homes for cats.

There is a cat that needs you to provide a home for them. So if you are a cat lover, please come and see if there isn't some way to find a home in your heart for a cat.

Thanks!!

Cats are people too.

And here is the output I'm getting which is obviously wrong

Dear Homeowner,

dog are important to people. We all enjoy the company of cats. If you have ever wanted to own a cat we can help. We are attempting to hold a “cat comes home” day for our city. To help us we've enlisted the NWMSU Bearcat cheerleaders, the organizers of the city's annual catapult toss, and local celebrities to help find homes for cats.

There is a cat that needs you to provide a home for them. So if you are a cat lover, please come and see if there isn't some way to find a home in your heart for a cat.

Thanks!!

dog are people too.

And this is my code:

sed 's/[Cc]at[s] /dog /g' cats-dogs.txt 

Upvotes: 3

Views: 208

Answers (3)

glenn jackman
glenn jackman

Reputation: 246877

Using perl, but it ain't pretty:

perl -pe 's/\b(c)at(?=s?\b)/ $1 =~ m{[[:upper:]]} ? "Dog" : "dog" /ige' <<END
scat cat cats Cats Cat Catskills 
END

outputs

scat dog dogs Dogs Dog Catskills 

Upvotes: 0

declension
declension

Reputation: 4185

I'm pretty sure you can't do this in (a single) RegEx alone.

That said, the simple solution might be the best here, as there are only two possible cases (upper and lower) and one replacement word, it seems (Also sed allows multiple replacement easily).

So something like this should work (assuming GNU sed):

sed -r 's/\bCat(s?)\b/Dog\1/g; s/\bcat(s?)\b/dog\1/g' cats-dogs.txt

Using extended regexp as it's far less horrible to quote on the command line. Note the scanning for word boundaries here too.

There's probably a very clever (and unreadable) sed way of doing this using \u and buffers too..

Upvotes: 3

ghoti
ghoti

Reputation: 46856

Let's parse your attempt so far.

s/[Cc]at[s] /dog /g

This searches for the regex [Cc]at[s] and substitutes dog. There are a few reasons it doesn't work...

  • It fails to maintain capitalization for the first letter.
  • The second range, [s] just means "the letter s".

If you're using Linux, then the version of sed installed on your system is probably GNU sed, with which the following might work:

sed -r 's/\bcat(s?)\b/dog\1/g;s/\bCat(s?)\b/Dog\1/g'

Note the -r option, which tells sed to use "Extended" regular expression notation rather than its default "Basic" notation.

This solution relies on sed's understanding of the \b word boundary, but it's important to note that this shorthand is NOT universally available in the sed implementations on other operating systems (FreeBSD, OSX, Solaris, etc). If portability is important, avoid using \b and similar things.

This shorthand is nice, but really isn't required. Here's the same thing in BRE:

sed 's/[[:<:]]cat\(s*\)[[:>:]]/dog\1/g;s/[[:<:]]Cat\(s*\)[[:>:]]/Dog\1/g'

This is BRE instead of ERE, so we don't use the -r option. I should point out that this will also match "catssss" because we're using s* instead of s?. The BRE in many sed implementations doesn't include a way to identify just one occurrence of an atom.

The traditional classes [[:<:]] and [[:>:]] apply to the beginning or end of a word, which may sometimes be preferred over GNU sed's "word boundary" which can be used for beginning or end of words.

The non-GNU RE format can be seen on any unix with man re_format.

(NOTE: sed's -r option is also not universal. In OSX, use -E instead. This is because OSX's sed is derived from an older version of FreeBSD, which only added -r as an equivalent option to -E a few versions ago.)

Upvotes: 3

Related Questions