Reputation:

Separating specific words with underscores, but not the plural form

I've been working with regex on strings recently and I've hit a snag. You see, I'm trying to get this:

chocolatecakes
thecakeismine
cakessurpassexpectation

to do this:

chocolate_cakes
the_cake_ismine
cakes_surpassexpectation

However, when I use this:

#!/bin/sh

words_array=(is cake)
number_of_times=0

word_underscorer (){
    echo $1 | sed -r "s/([a-z])($2)/\1_\2/g" | sed -r "s/($2)([a-z])/\1_\2/g"
}

for words_to_underscore in "${words_array[@]}"; do

    if [ "$number_of_times" -eq 0 ]; then
        first=`word_underscorer "chocolatecakes" "$words_to_underscore"`
        second=`word_underscorer "thecakeismine" "$words_to_underscore"`
        third=`word_underscorer "cakessurpassexpectation" "$words_to_underscore"`
    else
        word_underscorer "$first" "$words_to_underscore"
        word_underscorer "$second" "$words_to_underscore"
        word_underscorer "$third" "$words_to_underscore"
    fi

    echo "$first"
    echo "$second"
    echo "$third"
done

I get this:

chocolate_cake_s
the_cake_ismine
cake_ssurpassexpectation

I'm not sure how to fix this.

Upvotes: 0

Answers (3)

potong

Reputation: 58578

This might work for you (GNU sed):

sed -r 's/\B([^_])\B(cakes?|is)\B/\1_\2/g;s/(cakes?|is)\B([^_])\B/\1_\2/g' file

Insert an underscore infront/behind a particular word if the particular word is within another word and the character before/after the particular word is not an underscore.

Upvotes: 0

perreal

Reputation: 98118

If you write the words to a file (words) then you can do something like this:

sed -e 's/\('$(sed ':l;N;s/\n/\\|/;bl' words )'\)/\1_'/g -e 's/_$//' input

This gives you:

chocolate_cakes
the_cake_ismine
cakes_surpassexpectation

The main point is to construct this sed command:

sed -e s/\(chocolate\|cake\|the\|cakes\)/\1_/g -e s/_$// input

Upvotes: 1

l'L'l

Reputation: 47292

Based on what you've shown you could do something such as:

sed -r -e "s/($2)/_\1_/g"  -r -e "s/($2)_s|^($2)(_*)/\1s\2_/g" -r -e "s/^_|_$//g"

That should return the final result of:

chocolate_cakes
the_cake_ismine
cakes_surpassexpectation

The idea here is process by elimination; that is not to say that this method doesn't have potential issues — you'll hopefully understand what I mean below. Each sed operation is labeled by number to help you see what is happening.

The sed commands work on the array, which starts out with "is" and then "cake":

1. is  ->  _is_
2. is_s or is_  ->  iss or is_
3. _is_  ->  is

1. cake  ->  _cake_
2. cake_s or cake_  ->  cakes or cake_
3. _cake_  ->  cake

string one:

1. chocolatecakes -> chocolate_cake_s
2. chocolate_cake_s -> chocolate_cakes_
3. chocolate_cakes_ -> chocolate_cakes

string two:

1. thecake_is_mine -> the_cake_ismine
2. the_cake_ismine -> no change
3. the_cake_ismine -> no change

string three:

1. cakessurpassexpectation -> _cake_ssurpassexpectation
2. _cake_ssurpassexpectation -> _cakes_surpassexpectation
3. _cakes_surpassexpectation -> cakes_surpassexpectation

So you can see here what the issue might be with the "is" portion of the array; it could possibly get broken up perhaps in an undesired way during the sed operation if somehow it ends up becoming "is_s" on operation number 2. This is where you'll want to test multiple combinations of your strings to ensure that you've covered all the possible scenarios you don't want. Once you've done that you can go back and refine the patterns as needed, or even further find ways to optimize things in a way that allows you to use less piped commands.

Upvotes: 1

Separating specific words with underscores, but not the plural form

Answers (3)

Related Questions