Reputation: 2912
I have the following text
text = "Perennials. Stolons slender. Perianth bristles 6 or 7, ca. 2 × as long as nutlet"
I want to split the passage using the separate defined as ".\s[A-Z]". However, I still wish to retain the [A-Z] within the original sentence, such that the output is this:
['Perennials',
'Stolons slender',
'Perianth bristles 6 or 7, ca. 2 × as long as nutlet']
So far I have done is:
re.split(r'\.\s[A-Z]', text)
but it removed the first alphabets:
['Perennials',
'tolons slender',
'erianth bristles 6 or 7, ca. 2 × as long as nutlet']
Can anyone help? Thanks~
Upvotes: 0
Views: 35
Reputation: 521457
Split using a lookahead:
result = re.split(r'\.\s(?=[A-Z])', text)
print(result)
['Perennials', 'Stolons slender', 'Perianth bristles 6 or 7, ca. 2 × as long as nutlet']
The lookahead (?=[A-Z])
will assert, but not consume, that what follows the dot and space is a capital letter.
Upvotes: 2