How to highlight all inside-a-line word duplicates?

Question

I have this plain text file :

1 : Foo Bar Bar Baz
2 : The dog The cat the hamster
3 : the dog the cat the hamster
4 : The dog the cat the hamster The Doors

and I want to highlight all the repeated words.

I have tinkered this solution :

/$\<\w\+\>$\zs[ w]\+\1

This works well with the first line (the second Bar is highlighted) but this does not solve the following problems :

line 2 : it highlights the second "The", but not the lowercase "the"

line 3 : it highlights only the first repetition of "the"

line 4 : it highlights only the second "The" (the first repeated word), but not the second "the" (the other repeated word)

PS : just for the sake of it, a side question : is it possible to highlight repeated words in neighbouring lines so as to see the repetition of the same word on a span of 3 lines ?

Thank you in advance

Ruud Helderman · Accepted Answer

This works slightly better:

/$\<\w\+\>$.\{-}\zs\<\1\>

Changes:

Replaced [ w]\+ by .\{-} (match any character between the pair of words; non-greedy quantifier).
Moved \zs forward.
Surrounded \1 by \<...\>; these anchors are not automatically inherited from the capturing subpattern.

Unfortunately, there is one remaining problem; the third "the" on line 3 and the second "the" on line 4 are not matched due to overlap with another match. This can be resolved by using a look-behind pattern \@<= (with a possible performance penalty) instead of \zs:

/\%(\<\1\>.\+\)\@<=$\<\w\+\>$

Getting this to work was a bit of trial and error; it requires:

not to use capturing subpatterns in the look-behind pattern (i.e. use \%(...\) instead of $...$ )
not to use non-greedy quantifiers in the look-behind pattern (i.e. go back to OP's \+ instead of {-})

The following optional additions can be used in both patterns:

Add \c to the pattern for case-insensitivity.
Replace . by \_. to find word duplicates across line breaks.

Please note with the second pattern, a cross-line match has limited scope. Look-behind searches backward no further than one line. This is by design, to avoid serious performance issues. So duplicate words are recognized as such only when the two words are on the same line, or on lines immediately following each other.

Things become easier if we follow Qeole's example to match the first word(s), rather than the last one(s); we can use a look-ahead, which does not have most of the drawbacks involved with look-behind. Here it is, complete with case-insensitivity and cross-line matching:

/\<$\w\+$\>$\_.\+\<\1\>$\@=\c

We can even combine the two pattern with \| to include both the first and the last word in the search results:

/\%(\<\1\>\_.\+\)\@<=$\<\w\+\>$\|\<$\w\+$\>$\_.\+\<\1\>$\@=\c

Important note from Peter Rincker: lookbehind may not work in some versions of VIM 7.4 due to a bug in the NFA regex engine. You can force it to use the old backtracking engine and get the desired results by prepending the pattern \%#=1. See :h NFA for more information. Do not use in VIM 7.3 and older.

/\%#=1\%(\<\1\>\_.\+\)\@<=$\<\w\+\>$\|\<$\w\+$\>$\_.\+\<\1\>$\@=\c

How to highlight all inside-a-line word duplicates?

Answers (2)

Related Questions