Arnold Klein
Arnold Klein

Reputation: 3086

find words comprised of repeated characters in python

Disclaimer: I found quite a lot of similar questions, but not the specific one. Once someone answered, I will delete it.

I need to find all masked words such as:

AAAAA likes apples, but BBBBB likes bananas. Their phone numbers are ffffr and ggggh.

The pattern is at least three times of a repeated character.

When I use:

import re

p = re.compile(r'[a-z]{3,}, re.IGNORECASE)
m = p.findall('AAAAA likes apples, but BBBBB likes bananas. Their phone numbers are ffffr and ggggh.')

I get simply all words (which contain more than 3 characters).

Ideally, I should get only:

m = ['AAAAA', 'BBBBB', 'ffffr', 'ggggh']

How should I change the rage rules to capture only those?

Thanks!

Upvotes: 1

Views: 1432

Answers (3)

bobble bubble
bobble bubble

Reputation: 18490

Your current regex just checks for three or more [a-z] but not for repeated. For checking if a letter is repeated, you'd need to capture and backreference it later. Using your re.IGNORECASE

\b\w*?([a-z])\1\1\w*\b
  • \b matches a word boundary
  • \w matches a word character
  • ([a-z]) captures an alphabetic character to \1
  • \1 is a backreference to what's captured by the first group

See demo at regex101

This would match at least 3 repeated [a-z] surrounded by any amount of \w word characters.

Upvotes: 3

Daweo
Daweo

Reputation: 36370

You can use regular expression, but I suggest using other way, namely:

txt = 'AAAAA likes apples, but BBBBB likes bananas. Their phone numbers are ffffr and ggggh.'
words = txt.split(' ')
found = [i for i in words if len(set(i[:3].lower()))==1]
print(found) # ['AAAAA', 'BBBBB', 'ffffr', 'ggggh.']

Note that now found is not exactly same as your desired output, because of . in last element, but we could easily remove any trailing punctuation following way:

import string
clear_found = [i.rstrip(string.punctuation) for i in found]
print(clear_found) # ['AAAAA', 'BBBBB', 'ffffr', 'ggggh']

Explanation of my method: I get 3 first characters of word, turn all of them lowercase, then use set to check if there is only one letter (character). Alternatively you could use .upper method of str. Feel free to use regex-based solution if you consider it better suited for your use-case, but please keep in mind that there is possibility of non-regex solution for certain problems.

Upvotes: 1

Emma
Emma

Reputation: 27723

Here, if we wish to capture a word, we would be using a word boundary with back-referencing with an expression similar to:

\b([a-z])\1\1\1.+?\b

Demo

Test

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"\b([a-z])\1\1\1.+?\b"

test_str = "AAAAA likes apples, but BBBBB likes bananas. Their phone numbers are ffffr and ggggh."

matches = re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

RegEx Circuit

jex.im visualizes regular expressions:

enter image description here

Upvotes: 1

Related Questions