Remi Guan
Remi Guan

Reputation: 22292

Find "one letter that appears twice" in a string

I'm trying to catch if one letter that appears twice in a string using RegEx (or maybe there's some better ways?), for example my string is:

ugknbfddgicrmopn

The output would be:

dd

However, I've tried something like:

re.findall('[a-z]{2}', 'ugknbfddgicrmopn')

but in this case, it returns:

['ug', 'kn', 'bf', 'dd', 'gi', 'cr', 'mo', 'pn']   # the except output is `['dd']`

I also have a way to get the expect output:

>>> l = []
>>> tmp = None
>>> for i in 'ugknbfddgicrmopn':
...     if tmp != i:
...         tmp = i
...         continue
...     l.append(i*2)
...     
... 
>>> l
['dd']
>>> 

But that's too complex...


If it's 'abbbcppq', then only catch:

abbbcppq
 ^^  ^^

So the output is:

['bb', 'pp']

Then, if it's 'abbbbcppq', catch bb twice:

abbbbcppq
 ^^^^ ^^

So the output is:

['bb', 'bb', 'pp']

Upvotes: 58

Views: 14831

Answers (8)

Kasravnd
Kasravnd

Reputation: 107347

As a Pythonic way You can use zip function within a list comprehension:

>>> s = 'abbbcppq'
>>>
>>> [i+j for i,j in zip(s,s[1:]) if i==j]
['bb', 'bb', 'pp']

If you are dealing with large string you can use iter() function to convert the string to an iterator and use itertols.tee() to create two independent iterator, then by calling the next function on second iterator consume the first item and use call the zip class (in Python 2.X use itertools.izip() which returns an iterator) with this iterators.

>>> from itertools import tee
>>> first = iter(s)
>>> second, first = tee(first)
>>> next(second)
'a'
>>> [i+j for i,j in zip(first,second) if i==j]
['bb', 'bb', 'pp']

Benchmark with RegEx recipe:

# ZIP
~ $ python -m timeit --setup "s='abbbcppq'" "[i+j for i,j in zip(s,s[1:]) if i==j]"
1000000 loops, best of 3: 1.56 usec per loop

# REGEX
~ $ python -m timeit --setup "s='abbbcppq';import re" "[i[0] for i in re.findall(r'(([a-z])\2)', 'abbbbcppq')]"
100000 loops, best of 3: 3.21 usec per loop

After your last edit as mentioned in comment if you want to only match one pair of b in strings like "abbbcppq" you can use finditer() which returns an iterator of matched objects, and extract the result with group() method:

>>> import re
>>> 
>>> s = "abbbcppq"
>>> [item.group(0) for item in re.finditer(r'([a-z])\1',s,re.I)]
['bb', 'pp']

Note that re.I is the IGNORECASE flag which makes the RegEx match the uppercase letters too.

Upvotes: 32

Dima Tisnek
Dima Tisnek

Reputation: 11779

It is pretty easy without regular expressions:

In [4]: [k for k, v in collections.Counter("abracadabra").items() if v==2]
Out[4]: ['b', 'r']

Upvotes: 5

Avinash Raj
Avinash Raj

Reputation: 174796

You need use capturing group based regex and define your regex as raw string.

>>> re.search(r'([a-z])\1', 'ugknbfddgicrmopn').group()
'dd'
>>> [i+i for i in re.findall(r'([a-z])\1', 'abbbbcppq')]
['bb', 'bb', 'pp']

or

>>> [i[0] for i in re.findall(r'(([a-z])\2)', 'abbbbcppq')]
['bb', 'bb', 'pp']

Note that , re.findall here should return the list of tuples with the characters which are matched by the first group as first element and the second group as second element. For our case chars within first group would be enough so I mentioned i[0].

Upvotes: 51

Mark White
Mark White

Reputation: 660

A1 = "abcdededdssffffccfxx"

print A1[1]
for i in range(len(A1)-1):
    if A1[i+1] == A1[i]:
        if not A1[i+1] == A1[i-1]:
            print A1[i] *2

Upvotes: 2

Lavi Avigdor
Lavi Avigdor

Reputation: 4182

"or maybe there's some better ways"

Since regex is often misunderstood by the next developer to encounter your code (may even be you), And since simpler != shorter,

How about the following pseudo-code:

function findMultipleLetters(inputString) {        
    foreach (letter in inputString) {
        dictionaryOfLettersOccurrance[letter]++;
        if (dictionaryOfLettersOccurrance[letter] == 2) {
            multipleLetters.add(letter);
        }
    }
    return multipleLetters;
}
multipleLetters = findMultipleLetters("ugknbfddgicrmopn");

Upvotes: 3

xhg
xhg

Reputation: 1875

Maybe you can use the generator to achieve this

def adj(s):
    last_c = None
    for c in s:
        if c == last_c:
            yield c * 2
        last_c = c

s = 'ugknbfddgicrmopn'
v = [x for x in adj(s)]
print(v)
# output: ['dd']

Upvotes: 4

Mayur Koshti
Mayur Koshti

Reputation: 1862

>>> l = ['ug', 'kn', 'bf', 'dd', 'gi', 'cr', 'mo', 'pn']
>>> import re
>>> newList = [item for item in l if re.search(r"([a-z]{1})\1", item)]
>>> newList
['dd']

Upvotes: 0

Gurupad Hegde
Gurupad Hegde

Reputation: 2155

Using back reference, it is very easy:

import re
p = re.compile(ur'([a-z])\1{1,}')
re.findall(p, u"ugknbfddgicrmopn")
#output: [u'd']
re.findall(p,"abbbcppq")
#output: ['b', 'p']

For more details, you can refer to a similar question in perl: Regular expression to match any character being repeated more than 10 times

Upvotes: 9

Related Questions