Stef Man
Stef Man

Reputation: 65

How can I match a line containing only a single instance of a pattern with grep?

Given a text file such as this, say phrases.txt with contents:

Hahahahahasdhfjshfjshdhfjhdf
Hahahaha!
jdsahjhshfjhfHahahaha!dhsjfhajhfjhf
Hahaha!Hahaha!
dfhjfsf
sdfjsjf Hahaha! djfhjsdfh
Ha! hdfshdfs
Ha! Ha! Ha!

What would be an appropriate grep command in bash that would output only the lines that contain only a single occurrence of laughter, where laughter is defined as a string of the form Hahahahaha! with arbitrarily many has. The first H is always capital and the other ones are not, and the string must end in !. In my example, the egrep command should output:

Hahahaha!
jdsahjhshfjhfHahahaha!dhsjfhajhfjhf
sdfjsjf Hahaha! djfhjsdfh
Ha! hdfshdfs

A command I came up with was:

egrep "(Ha(ha)*\!){1}" phrases.txt

The issue with my command is that it does not only output the lines with only a single occurrence of laughter. With my command, line 4 (Hahaha!Hahaha!) and line 8 (Ha! Ha! Ha!) also get printed which is not what I want.

Is there a nice way to do this with only grep?

Upvotes: 3

Views: 1976

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

If you use a GNU grep or pcregrep that support PCRE regex, you may use

grep -P '^(?!(?:.*Ha(ha)*!){2}).*Ha(ha)*!'

The pattern is:

^(?!(?:.*YOUR_PATTERN_HERE){2}).*YOUR_PATTERN_HERE

where YOUR_PATTERN_HERE stands for your pattern you want to occur only once in the string.

Details

  • ^ - start of a strig
  • (?!(?:.*YOUR_PATTERN_HERE){2}) - a negative lookahead that fails the match, immediately to the right of the current location (here, the start of string), there are two consecutive occurrences of
    • .* - any 0+ chars other than line break chars
    • YOUR_PATTERN_HERE - your required pattern
  • .* - any 0+ chars other than line break chars
  • YOUR_PATTERN_HERE - your required pattern.

See the online demo:

s="Hahahahahasdhfjshfjshdhfjhdf
Hahahaha!
jdsahjhshfjhfHahahaha!dhsjfhajhfjhf
Hahaha!Hahaha!
dfhjfsf
sdfjsjf Hahaha! djfhjsdfh
Ha! hdfshdfs
Ha! Ha! Ha!"
echo "$s" | grep -P '^(?!(?:.*Ha(ha)*!){2}).*Ha(ha)*!'

Output:

Hahahaha!
jdsahjhshfjhfHahahaha!dhsjfhajhfjhf
sdfjsjf Hahaha! djfhjsdfh
Ha! hdfshdfs

Upvotes: 0

Algorithmic Canary
Algorithmic Canary

Reputation: 742

you are okay with pipes then

egrep '(Ha(ha)*!)' yourfile.txt | egrep -v '(Ha(ha)*!).*(Ha(ha)*!)'

first filter for at least one laugh, then filter out the ones that have more than one laugh.

Note: {1} only repeats the previous chunk, it doesn't check the rest of the string to insist that there is only one. So a{1} and a are actually the same.

Upvotes: 2

Related Questions