Reputation: 4767
Let's say I have the following block of text:
Hi, here
is some text.
This is some Hi here more
And some.
I would like to highlight items with multiple words, like this:
However, I only want it to highlight the first match -- in other words the word shouldn't have a match before it (the second some
shouldn't show up). The only way I can think to do this is using a negative lookbehind, but I'm using python's regex which doesn't allow variable-length lookbehinds. How could this be done?
And yes, of course I can do something like the following:
>>> from collections import Counter;Counter('Hi, here\nis some text. \nThis is some Hi here more\nAnd some.'.split())
Counter({'some': 2, 'is': 2, 'here': 2, 'And': 1, 'This': 1, 'text.': 1, 'some.': 1, 'Hi': 1, 'Hi,': 1, 'more': 1})
But I'm curious if it's possible to do this with a regex.
Upvotes: 3
Views: 114
Reputation: 626853
This is a task that is always better done with a combination of regex and code:
import re
text = 'Hi, here\nis some text. \nThis is some Hi here more\nAnd some.'
print( list(set(re.findall(r'\b([a-z]{2,})\b(?=.*\b\1\b)', text, re.DOTALL))) )
# => ['here', 'some', 'is']
See this Python demo.
If you have some very specific task only involving a single regex operation, you need to install the PyPi regex library (type pip install regex
or pip3 intall regex
in the terminal and hit ENTER) and use
import regex
text = r'''Hi, here
is some text.
This is some Hi here more
And some.'''
print( regex.findall(r'\b([a-z]{2,})\b(?<!\b\1\b.*\1)(?=.*\b\1\b)', text, regex.DOTALL) )
# => ['here', 'some', 'is']
See this Python demo and this ECMAScript regex demo. The (?<!\b\1\b.*\1)
lookbehind fails the match if the word captured into Group 1 appears anywhere before this match.
Note your regex does not assume there can be overlapping matches since it only matches whole words composed of two or more lowercase ASCII letters, hence, I removed the capturing group and the outer lookahead.
Upvotes: 1