Reputation: 3
I have a long list of strings which are all random words, all of them capitalized, such as 'Pomegranate'
and 'Yellow Banana'
. However, some of them are stuck together, like so: 'AppleOrange'
. There are no special characters or digits.
What I need is a regular expression on Python that matches 'Apple'
and 'Orange'
separately, but not 'Pomegranate'
or 'Yellow'
.
As expected, I'm very new to this, and I've only managed to write r"(?<!\s)([A-Z][a-z]*)"
... But that still matches 'Yellow'
and 'Pomegranate'
. How do I do this?
Upvotes: 0
Views: 113
Reputation: 163362
If they all start with an uppercase char and optional lowercase chars, you can make use of lookarounds and an alternation to match both variations
(?<=[a-z])[A-Z][a-z]*|[A-Z][a-z]*(?=[A-Z])
The pattern matches:
(?<=[a-z])
Assert a-z to the left[A-Z][a-z]*
match A-Z and optional chars a-z|
or[A-Z][a-z]*
match A-Z and optional chars a-z(?=[A-Z])
Assert A-Z to the rightExample
import re
pattern = r"(?<=[a-z])[A-Z][a-z]*|[A-Z][a-z]*(?=[A-Z])"
s = ("AppleOrange\nPomegranate Yellow Banana")
print(re.findall(pattern, s))
Output
['Apple', 'Orange']
Another option could be getting out of the way what you don't want by matching it, and use a capture group for what you want to keep and remove the empty entries from the result:
(?<!\S)[A-Z][a-z]*(?!\S)|([A-Z][a-z]*)
import re
pattern = r"(?<!\S)[A-Z][a-z]*(?!\S)|([A-Z][a-z]*)"
s = ("AppleOrange\nPomegranate Yellow Banana")
print([x for x in re.findall(pattern, s) if x])
Upvotes: 1
Reputation: 101
This work:
import re
from collections import deque
pattern = r'([A-Z]{2,}(?=[A-Z]|$)|[A-Z](?=[a-z]|$))'
chunks = deque(re.split(pattern, 'AppleOrange'))
result = []
while len(chunks):
buf = chunks.popleft()
if len(buf) == 0:
continue
if re.match(r'^[A-Z]$', buf) and len(chunks):
buf += chunks.popleft()
result.append(buf)
print(result)
Output:
['Apple', 'Orange']
Check the OP here
Upvotes: 1