Reputation: 3086
Disclaimer: I found quite a lot of similar questions, but not the specific one. Once someone answered, I will delete it.
I need to find all masked words such as:
AAAAA likes apples, but BBBBB likes bananas. Their phone numbers are ffffr and ggggh.
The pattern is at least three times of a repeated character.
When I use:
import re
p = re.compile(r'[a-z]{3,}, re.IGNORECASE)
m = p.findall('AAAAA likes apples, but BBBBB likes bananas. Their phone numbers are ffffr and ggggh.')
I get simply all words (which contain more than 3 characters).
Ideally, I should get only:
m = ['AAAAA', 'BBBBB', 'ffffr', 'ggggh']
How should I change the rage rules to capture only those?
Thanks!
Upvotes: 1
Views: 1432
Reputation: 18490
Your current regex just checks for three or more [a-z]
but not for repeated. For checking if a letter is repeated, you'd need to capture and backreference it later. Using your re.IGNORECASE
\b\w*?([a-z])\1\1\w*\b
\b
matches a word boundary\w
matches a word character([a-z])
captures an alphabetic character to \1
\1
is a backreference to what's captured by the first groupThis would match at least 3 repeated [a-z]
surrounded by any amount of \w
word characters.
Upvotes: 3
Reputation: 36370
You can use regular expression, but I suggest using other way, namely:
txt = 'AAAAA likes apples, but BBBBB likes bananas. Their phone numbers are ffffr and ggggh.'
words = txt.split(' ')
found = [i for i in words if len(set(i[:3].lower()))==1]
print(found) # ['AAAAA', 'BBBBB', 'ffffr', 'ggggh.']
Note that now found
is not exactly same as your desired output, because of .
in last element, but we could easily remove any trailing punctuation following way:
import string
clear_found = [i.rstrip(string.punctuation) for i in found]
print(clear_found) # ['AAAAA', 'BBBBB', 'ffffr', 'ggggh']
Explanation of my method: I get 3 first characters of word, turn all of them lowercase, then use set
to check if there is only one letter (character). Alternatively you could use .upper
method of str
. Feel free to use regex-based solution if you consider it better suited for your use-case, but please keep in mind that there is possibility of non-regex solution for certain problems.
Upvotes: 1
Reputation: 27723
Here, if we wish to capture a word, we would be using a word boundary with back-referencing with an expression similar to:
\b([a-z])\1\1\1.+?\b
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"\b([a-z])\1\1\1.+?\b"
test_str = "AAAAA likes apples, but BBBBB likes bananas. Their phone numbers are ffffr and ggggh."
matches = re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
jex.im visualizes regular expressions:
Upvotes: 1