samuelbrody1249
samuelbrody1249

Reputation: 4767

Find repeated words, first occurrence only

Let's say I have the following block of text:

Hi, here
is some text. 
This is some Hi here more
And some.

I would like to highlight items with multiple words, like this:

enter image description here

However, I only want it to highlight the first match -- in other words the word shouldn't have a match before it (the second some shouldn't show up). The only way I can think to do this is using a negative lookbehind, but I'm using python's regex which doesn't allow variable-length lookbehinds. How could this be done?


And yes, of course I can do something like the following:

>>> from collections import Counter;Counter('Hi, here\nis some text. \nThis is some Hi here more\nAnd some.'.split())
Counter({'some': 2, 'is': 2, 'here': 2, 'And': 1, 'This': 1, 'text.': 1, 'some.': 1, 'Hi': 1, 'Hi,': 1, 'more': 1})

But I'm curious if it's possible to do this with a regex.

Upvotes: 3

Views: 114

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626853

This is a task that is always better done with a combination of regex and code:

import re
text = 'Hi, here\nis some text. \nThis is some Hi here more\nAnd some.'
print( list(set(re.findall(r'\b([a-z]{2,})\b(?=.*\b\1\b)', text, re.DOTALL))) )
# => ['here', 'some', 'is']

See this Python demo.

If you have some very specific task only involving a single regex operation, you need to install the PyPi regex library (type pip install regex or pip3 intall regex in the terminal and hit ENTER) and use

import regex
text = r'''Hi, here
is some text. 
This is some Hi here more
And some.'''
print( regex.findall(r'\b([a-z]{2,})\b(?<!\b\1\b.*\1)(?=.*\b\1\b)', text, regex.DOTALL) )
# => ['here', 'some', 'is']

See this Python demo and this ECMAScript regex demo. The (?<!\b\1\b.*\1) lookbehind fails the match if the word captured into Group 1 appears anywhere before this match.

Note your regex does not assume there can be overlapping matches since it only matches whole words composed of two or more lowercase ASCII letters, hence, I removed the capturing group and the outer lookahead.

Upvotes: 1

Related Questions