Reputation: 2401
I have this plain text file :
1 : Foo Bar Bar Baz
2 : The dog The cat the hamster
3 : the dog the cat the hamster
4 : The dog the cat the hamster The Doors
and I want to highlight all the repeated words.
I have tinkered this solution :
/\(\<\w\+\>\)\zs[ w]\+\1
This works well with the first line (the second Bar is highlighted) but this does not solve the following problems :
line 2 : it highlights the second "The", but not the lowercase "the"
line 3 : it highlights only the first repetition of "the"
line 4 : it highlights only the second "The" (the first repeated word), but not the second "the" (the other repeated word)
PS : just for the sake of it, a side question : is it possible to highlight repeated words in neighbouring lines so as to see the repetition of the same word on a span of 3 lines ?
Thank you in advance
Upvotes: 2
Views: 130
Reputation: 11018
This works slightly better:
/\(\<\w\+\>\).\{-}\zs\<\1\>
Changes:
[ w]\+
by .\{-}
(match any character between the pair of words; non-greedy quantifier).\zs
forward.\1
by \<...\>
; these anchors are not automatically inherited from the capturing subpattern.Unfortunately, there is one remaining problem; the third "the" on line 3 and the second "the" on line 4 are not matched due to overlap with another match. This can be resolved by using a look-behind pattern \@<=
(with a possible performance penalty) instead of \zs
:
/\%(\<\1\>.\+\)\@<=\(\<\w\+\>\)
Getting this to work was a bit of trial and error; it requires:
\%(...\)
instead of \(...\)
)\+
instead of {-}
)The following optional additions can be used in both patterns:
\c
to the pattern for case-insensitivity..
by \_.
to find word duplicates across line breaks.Please note with the second pattern, a cross-line match has limited scope. Look-behind searches backward no further than one line. This is by design, to avoid serious performance issues. So duplicate words are recognized as such only when the two words are on the same line, or on lines immediately following each other.
Things become easier if we follow Qeole's example to match the first word(s), rather than the last one(s); we can use a look-ahead, which does not have most of the drawbacks involved with look-behind. Here it is, complete with case-insensitivity and cross-line matching:
/\<\(\w\+\)\>\(\_.\+\<\1\>\)\@=\c
We can even combine the two pattern with \|
to include both the first and the last word in the search results:
/\%(\<\1\>\_.\+\)\@<=\(\<\w\+\>\)\|\<\(\w\+\)\>\(\_.\+\<\1\>\)\@=\c
Important note from Peter Rincker: lookbehind may not work in some versions of VIM 7.4 due to a bug in the NFA regex engine. You can force it to use the old backtracking engine and get the desired results by prepending the pattern \%#=1
. See :h NFA
for more information. Do not use in VIM 7.3 and older.
/\%#=1\%(\<\1\>\_.\+\)\@<=\(\<\w\+\>\)\|\<\(\w\+\)\>\(\_.\+\<\1\>\)\@=\c
Upvotes: 3
Reputation: 9154
Disclaimer: This is an approximate answer, gathering what I wrote in comments for @Ruud's answer and OP's question.
First, here is a quick workaround to fix case problems (The
vs. the
):
:set ignorecase
Ruud's proposal to use \c
is probably better, though.
Then here is a proposal to hilight occurrences of repeated word, including first appearance of word on line, but excluding the last one:
/\(\<\w\+\>\)\ze.\{-}\1
It's pretty similar to Ruud's solution (even if I got it on my own). If anyone has a fix for the last occurrence on line, I'm interested in learning it.
As Ruud − again − also pointed out, one solution might be to use Vim \@<=
. There's an interesting example in documentation for this \@<=
regex atom:
/\1\@<=,\([a-z]\+\)
should match ,abc
in abc,abc
, but I couldn't make this example work (seems to me that \1
remains empty).
And I have no idea for side question, sorry.
Upvotes: 1