Find repeated words, first occurrence only

Question

Let's say I have the following block of text:

Hi, here
is some text. 
This is some Hi here more
And some.

I would like to highlight items with multiple words, like this:

(?=\b([a-z]{2,})\b.*\b\1\b)\1

However, I only want it to highlight the first match -- in other words the word shouldn't have a match before it (the second some shouldn't show up). The only way I can think to do this is using a negative lookbehind, but I'm using python's regex which doesn't allow variable-length lookbehinds. How could this be done?

And yes, of course I can do something like the following:

>>> from collections import Counter;Counter('Hi, here
is some text. 
This is some Hi here more
And some.'.split())
Counter({'some': 2, 'is': 2, 'here': 2, 'And': 1, 'This': 1, 'text.': 1, 'some.': 1, 'Hi': 1, 'Hi,': 1, 'more': 1})

But I'm curious if it's possible to do this with a regex.

Wiktor Stribiżew · Accepted Answer

This is a task that is always better done with a combination of regex and code:

import re
text = 'Hi, here
is some text. 
This is some Hi here more
And some.'
print( list(set(re.findall(r'\b([a-z]{2,})\b(?=.*\b\1\b)', text, re.DOTALL))) )
# => ['here', 'some', 'is']

See this Python demo.

If you have some very specific task only involving a single regex operation, you need to install the PyPi regex library (type pip install regex or pip3 intall regex in the terminal and hit ENTER) and use

import regex
text = r'''Hi, here
is some text. 
This is some Hi here more
And some.'''
print( regex.findall(r'\b([a-z]{2,})\b(? ['here', 'some', 'is']

See this Python demo and this ECMAScript regex demo. The (? lookbehind fails the match if the word captured into Group 1 appears anywhere before this match.


Note your regex does not assume there can be overlapping matches since it only matches whole words composed of two or more lowercase ASCII letters, hence, I removed the capturing group and the outer lookahead.

Find repeated words, first occurrence only

Answers (1)

Related Questions