Exactly two capitalized words on a line

Question

I want to create a regular expression which can replace lines that contain exactly two words beginning with an uppercase with the character 'X'.

I'm currently using this:

sed -e '/\b[A-Z][a-z]*\b c X /home/Morgan/desktop/test

The problem is the following: it only changes lines which contain 1 or more words described by the regular expression in my test.txt.

I don't know how to say that i want a X only on lines with exactly 2 words beginning with an uppercase. Either word can occur anywhere within the line.

My test.txt contains:

Bonjour oui oui Bonjour -> this must be replaced by X

Bonjour Bonjour Bonjour -> this mustn't

Bonjour Oui bonjour oui -> this must be replaced by X

tripleee · Accepted Answer

You seem to be attempting to use the Perl/PCRE word boundary \b but typical sed implementations do not understand this regular expression dialect. By your problem description, you are looking for beginning and end of line, anyway; this is a very basic regex anchor which was introduced already in the original grep: ^ matches beginning of line, and $ matches end of line.

Without anchors, a regular expression will match anywhere in the line. To say "only two" you really must check the entire line and make sure there are not three or more of what you're looking for.

"Find a line with exactly two words which begin with uppercase" needs to be rephrased or massaged a bit before you can attempt to write a regex. If we -- provisionally, for this discussion -- define w to mean "word which does not begin with uppercase" and W to mean one which does, you want ^w*Ww*Ww*$ -- exactly two uppercase words, and zero or more non-uppercase words in any position before, between, or after them.

A word which begins with uppercase is [A-Z][a-z]* (this requires all the subsequent characters to be lowercase) and a word which doesn't is [a-z][a-z]* (or [a-z]\+ if your sed supports that regex variation).

Because words need spaces between them, an optional word expression needs to be parenthesized so you can say "zero or more of this entire sequence". Typically, sed regex requires grouping parentheses to be backslashed as well, though this differs between versions.

So, try this:

sed 's/^$[a-z][a-z]* $*[A-Z][a-z]*$ [a-z][a-z]*$* [A-Z][a-z]*$ [a-z][a-z]*$*$/X/' file

If indeed you have GNU sed, this can be simplified a bit:

sed -r 's/^([a-z]+ )*[A-Z][a-z]*( [a-z]+)* [A-Z][a-z]*( [a-z]+)*$/X/' file

This definition of "word" might not be sufficient; perhaps you can refine it to suit your circumstances. In particular, the spacing is assumed to be regular (exactly one space between words; no leading or trailing whitespace on the lines) and no text may contain characters outside of spaces and the alphabetics a-z in upper or lower case. (Whether accented characters like è and Á are also considered alphabetics in this range depends on your locale settings. Maybe set LC_ALL=fr_FR.utf-8 in your script if French locale settings are important.)

Notice also how the sed substition command requires exactly three delimiter characters -- traditionally, we use a slash, but you can use any punctuation character. The form is s/regex/replacement/flags where the regex, the replacement, and the flags can all be empty, but the s and the delimiters are always required.

Exactly two capitalized words on a line

Answers (1)

Related Questions