Split by regex of new line and capital letter

Question

I've been struggling to split my string by a regex expression in Python.

I have a text file which I load that is in the format of:

"Peter went to the gym; 
he worked out for two hours 
Kyle ate lunch 
 at Kate's house. Kyle went home at 9. 
Some other sentence 
 here
\u2022Here's a bulleted line"

I'd like to get the following output:

['Peter went to the gym; he worked out for two hours','Kyle ate lunch 
at Kate's house. He went home at 9.', 'Some other sentence here', 
'\u2022Here's a bulleted line']

I'm looking to split my string by a new line and a capital letter or a bullet point in Python.

I've tried tackling the first half of the problem, splitting my string by just a new line and capital letter.

Here's what I have so far:

print re.findall(r'
[A-Z][a-z]+',str,re.M)

This just gives me:

[u'
Kyle', u'
Some']

which is just the first word. I've tried variations of that regex expression but I don't know how to get the rest of the line.

I assume that to also split by the bullet point, I would just include an OR regex expression that is in the same format as the regex of splitting by a capital letter. Is this the best way?

I hope this makes sense and I'm sorry if my question is in anyway unclear. :)

anubhava · Accepted Answer

You can use this split function:

>>> str = u"Peter went to the gym; 
he worked out for two hours 
Kyle ate lunch at Kate's house. Kyle went home at 9. 
Some other sentence here
\u2022Here's a bulleted line"
>>> print re.split(u'
(?=\u2022|[A-Z])', str)

[u'Peter went to the gym; 
he worked out for two hours ',
 u"Kyle ate lunch at Kate's house. Kyle went home at 9. ",
 u'Some other sentence here',
 u"\u2022Here's a bulleted line"]

Code Demo

Split by regex of new line and capital letter

Answers (2)

Related Questions