Reputation: 5927
I have a set of text files and a set of keywords that I need to find in those files. However, I am only interested in matching "complete words", that is strings between whotespaces. So for example, if I have text
line1: word1 word2,
line2: word3 word22
line3: word4 aword2
I want to get only line1 but not 2 or 3 if I search for word2
. Also, I need to know the line where the matched occurred, so I can't turn each text file in a bag of words and search there.
Can I use grep for this? If so, how? Or are there better alternatives?
Also, will this work if I want to search for a phrase instead, for example
line1: word1 word word2,
line2: word3 word word22
line3: word4 wword word2
should produce only line1 if I want to search for "word word2"
Upvotes: 1
Views: 708
Reputation: 364
Beware, users of pcre2grep! Use of -w option or \W in the regexp does not work well with accented characters. For example, using "(^|\W)class($|\W)" results in the following 2 lines also being matched:
"Verset déclassé",
"Segment de verset déclassé",
As you can see from this example, the accented e is not considered to be a word-forming character.
(NB: I am using pcre2grep 1022 - GNU grep 2.0d)
Upvotes: 0
Reputation: 26471
This is where you have grep
for and all its options:
-w, --word-regexp
: Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line or preceded by a non-word constituent character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore.source:
man grep
$ grep -w word2 file
Upvotes: 3
Reputation: 3576
I think you are looking for something like
grep -E "(^|\W)word2($|\W)" mytestfile.txt
The same thing should also work for your second scenario
grep -E "(^|\W)word word2($|\W)" mytestfile.txt
The -E is for extended-regexp (egrep). (^|\W) will match the beginning of a line or a non-alphanumeric character (^a-zA-Z0-9). ($|\W) will match the end of a line or a non-alaphnumeric character.
I tested this on OSX, but I think it will work generally on almost any system (GNU Grep has a -E option too).
Upvotes: 1