Jake
Jake

Reputation: 2912

Python regex split but retain certain characters for split

I have the following text

text = "Perennials. Stolons slender. Perianth bristles 6 or 7, ca. 2 × as long as nutlet"

I want to split the passage using the separate defined as ".\s[A-Z]". However, I still wish to retain the [A-Z] within the original sentence, such that the output is this:

['Perennials',
 'Stolons slender',
 'Perianth bristles 6 or 7, ca. 2 × as long as nutlet']

So far I have done is:

re.split(r'\.\s[A-Z]', text)

but it removed the first alphabets:

['Perennials',
 'tolons slender',
 'erianth bristles 6 or 7, ca. 2 × as long as nutlet']

Can anyone help? Thanks~

Upvotes: 0

Views: 35

Answers (1)

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521457

Split using a lookahead:

result = re.split(r'\.\s(?=[A-Z])', text)
print(result)

['Perennials', 'Stolons slender', 'Perianth bristles 6 or 7, ca. 2 × as long as nutlet']

The lookahead (?=[A-Z]) will assert, but not consume, that what follows the dot and space is a capital letter.

Upvotes: 2

Related Questions