Reputation: 26463
I have couple of sentences in UTF that I want to split based on first capital letter.
Examples:
"Tough Fox" -> "Tough", "Fox"
"Nice White Cat" -> "Nice", "White Cat"
"This is a lazy Dog" -> "This is a lazy", "Dog"
"This is hardworking Little Ant" -> "This is hardworking", "Little Ant"
What is pythonic way to do such splitting?
Upvotes: 1
Views: 160
Reputation: 16556
I would use re:
>>> import re
>>> l = ["Tough Fox", "Nice White Cat", "This is a lazy Dog" ]
>>> for i in l:
... print re.findall("[A-Z][^A-Z]*", i)
...
['Tough ', 'Fox']
['Nice ', 'White ', 'Cat']
['This is a lazy ', 'Dog']
Edit:
Okay, I thought that was a mistake. So now I am a little late, and re.split(..., s, maxsplit=1)
is imho the best way, but you could still do it without maxsplit:
>>> for i in l:
... print re.findall("^[^ ]*|[A-Z].*", i)
...
['Tough', 'Fox']
['Nice', 'White Cat']
['This', 'Dog']
Upvotes: 3
Reputation: 32497
If you want to split a string on each capital letter following a whitespace
import re
s = "Tough Fox"
re.split(r"\s(?=[A-Z])", s, maxsplit=1)
['Tough', 'Fox']
The re.split
method is equivalent to the Python builtin str.split
, but allows a regular expression to be used as split pattern.
The regex first looks for a whitespace (\s
) as the split pattern. This pattern will be eaten by the re.split
operation.
The (?=...)
part tells is a read-ahead predicate expression. The next character(s) in the string must match this predicate (in this case any capital letter, [A-Z]
). However, this part is not considered part of the match, so it will not be eaten by the re.split
operation.
The maxsplit=1
will make sure that only one split (maximum two items) occur.
Upvotes: 3
Reputation: 1122152
Use re.split()
with a limit:
space_split = re.compile(r'\s+(?=[A-Z])')
result = space_split.split(inputstring, 1)
Demo:
>>> import re
>>> space_split = re.compile(r'\s+(?=[A-Z])')
>>> l = ["Tough Fox", "Nice White Cat", "This is a lazy Dog" ]
>>> for i in l:
... print space_split.split(i, 1)
...
['Tough', 'Fox']
['Nice', 'White Cat']
['This is a lazy', 'Dog']
Upvotes: 1
Reputation: 65791
Maybe like this:
In [1]: import re
In [2]: def split(s):
...: return re.split(r'\W(?=[A-Z])', s, 1)
...:
In [3]: l = ["Tough Fox", "Nice White Cat", "This is a lazy Dog" ]
In [4]: for s in l:
...: print(split(s))
...:
['Tough', 'Fox']
['Nice', 'White Cat']
['This is a lazy', 'Dog']
Upvotes: 1