kennedyl
kennedyl

Reputation: 181

Splitting on group of capital letters in python

I'm trying to tokenize a number of strings using a capital letter as a delimited. I have landed on the following code:

token = ([a for a in re.split(r'([A-Z][a-z]*)', "ABCowDog") if a])

print token

And I get this, as expected, in return:

['A', 'B', 'Cow', 'Dog']

Now, this is just an example string to make life easier, but in my case I want to go through this list and find individual characters (easy enough with checking len()) and putting the individual letters together, provided they meet a prior definition. In the example above the strings 'AB', 'Cow', and 'Dog' are the strings I actually want to form (consecutive capitals are part of an acronym). For whatever reason, once I have my token, I am unable to figure out how to walk the list. Sorry if this is a simple answer, but I'm fairly new to python and am sick of banging my head against the wall.

Upvotes: 3

Views: 2197

Answers (3)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

re.split isn't always easy to use and seems sometimes limited in many situations. You can try a different approach with re.findall:

>>> s = 'ABCowDog'
>>> re.findall(r'[A-Z](?:[A-Z]*(?![a-z])|[a-z]*)', s)
['AB', 'Cow', 'Dog']

Upvotes: 3

karthik manchala
karthik manchala

Reputation: 13640

You can use the following to split with regex module:

(?=[A-Z][a-z])

See DEMO

Code:

regex.split(r'(?=[A-Z][a-z])', "ABCowDog",flags=regex.VERSION1)

Upvotes: 1

vks
vks

Reputation: 67968

([A-Z][a-z]+)

You should split by this.

Upvotes: 0

Related Questions