Reputation: 5927

how to grep for complete words between whitespaces?

I have a set of text files and a set of keywords that I need to find in those files. However, I am only interested in matching "complete words", that is strings between whotespaces. So for example, if I have text

line1: word1 word2,
line2: word3 word22
line3: word4 aword2

I want to get only line1 but not 2 or 3 if I search for word2. Also, I need to know the line where the matched occurred, so I can't turn each text file in a bag of words and search there.

Can I use grep for this? If so, how? Or are there better alternatives?

Also, will this work if I want to search for a phrase instead, for example

line1: word1 word word2,
line2: word3 word word22
line3: word4 wword word2

should produce only line1 if I want to search for "word word2"

Upvotes: 1

Answers (3)

Todd Hoatson

Reputation: 364

Beware, users of pcre2grep! Use of -w option or \W in the regexp does not work well with accented characters. For example, using "(^|\W)class($|\W)" results in the following 2 lines also being matched:

"Verset déclassé",

"Segment de verset déclassé",

As you can see from this example, the accented e is not considered to be a word-forming character.

(NB: I am using pcre2grep 1022 - GNU grep 2.0d)

Upvotes: 0

kvantour

Reputation: 26561

This is where you have grep for and all its options:

-w, --word-regexp: Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line or preceded by a non-word constituent character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore.

_{source: man grep}

$ grep -w word2 file

Upvotes: 3

EdmCoff

Reputation: 3596

I think you are looking for something like

grep -E "(^|\W)word2($|\W)" mytestfile.txt

The same thing should also work for your second scenario

grep -E "(^|\W)word word2($|\W)" mytestfile.txt

The -E is for extended-regexp (egrep). (^|\W) will match the beginning of a line or a non-alphanumeric character (^a-zA-Z0-9). ($|\W) will match the end of a line or a non-alaphnumeric character.

I tested this on OSX, but I think it will work generally on almost any system (GNU Grep has a -E option too).

Upvotes: 1

how to grep for complete words between whitespaces?

Answers (3)

Related Questions