Reputation: 181
I'm trying to tokenize a number of strings using a capital letter as a delimited. I have landed on the following code:
token = ([a for a in re.split(r'([A-Z][a-z]*)', "ABCowDog") if a])
print token
And I get this, as expected, in return:
['A', 'B', 'Cow', 'Dog']
Now, this is just an example string to make life easier, but in my case I want to go through this list and find individual characters (easy enough with checking len()) and putting the individual letters together, provided they meet a prior definition. In the example above the strings 'AB', 'Cow', and 'Dog' are the strings I actually want to form (consecutive capitals are part of an acronym). For whatever reason, once I have my token, I am unable to figure out how to walk the list. Sorry if this is a simple answer, but I'm fairly new to python and am sick of banging my head against the wall.
Upvotes: 3
Views: 2197
Reputation: 89557
re.split
isn't always easy to use and seems sometimes limited in many situations. You can try a different approach with re.findall
:
>>> s = 'ABCowDog'
>>> re.findall(r'[A-Z](?:[A-Z]*(?![a-z])|[a-z]*)', s)
['AB', 'Cow', 'Dog']
Upvotes: 3
Reputation: 13640
You can use the following to split with regex module:
(?=[A-Z][a-z])
See DEMO
Code:
regex.split(r'(?=[A-Z][a-z])', "ABCowDog",flags=regex.VERSION1)
Upvotes: 1