Rohan
Rohan

Reputation: 455

Split by regex of new line and capital letter

I've been struggling to split my string by a regex expression in Python.

I have a text file which I load that is in the format of:

"Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch 
 at Kate's house. Kyle went home at 9. \nSome other sentence 
 here\n\u2022Here's a bulleted line"

I'd like to get the following output:

['Peter went to the gym; he worked out for two hours','Kyle ate lunch 
at Kate's house. He went home at 9.', 'Some other sentence here', 
'\u2022Here's a bulleted line']

I'm looking to split my string by a new line and a capital letter or a bullet point in Python.

I've tried tackling the first half of the problem, splitting my string by just a new line and capital letter.

Here's what I have so far:

print re.findall(r'\n[A-Z][a-z]+',str,re.M)

This just gives me:

[u'\nKyle', u'\nSome']

which is just the first word. I've tried variations of that regex expression but I don't know how to get the rest of the line.

I assume that to also split by the bullet point, I would just include an OR regex expression that is in the same format as the regex of splitting by a capital letter. Is this the best way?

I hope this makes sense and I'm sorry if my question is in anyway unclear. :)

Upvotes: 2

Views: 1990

Answers (2)

anubhava
anubhava

Reputation: 785196

You can use this split function:

>>> str = u"Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch at Kate's house. Kyle went home at 9. \nSome other sentence here\n\u2022Here's a bulleted line"
>>> print re.split(u'\n(?=\u2022|[A-Z])', str)

[u'Peter went to the gym; \nhe worked out for two hours ',
 u"Kyle ate lunch at Kate's house. Kyle went home at 9. ",
 u'Some other sentence here',
 u"\u2022Here's a bulleted line"]

Code Demo

Upvotes: 1

Ajax1234
Ajax1234

Reputation: 71451

You can split at a \n proceeded by a capital letter or the bullet character:

import re
s = """
Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch 
at Kate's house. Kyle went home at 9. \nSome other sentence 
here\n\u2022Here's a bulleted line
"""
new_list = filter(None, re.split('\n(?=•)|\n(?=[A-Z])', s))

Output:

['Peter went to the gym; \nhe worked out for two hours ', "Kyle ate lunch \nat Kate's house. Kyle went home at 9. ", 'Some other sentence \nhere', "•Here's a bulleted line\n"]

Or, without using the symbol for the bullet character:

new_list = filter(None, re.split('\n(?=\u2022)|\n(?=[A-Z])', s))

Upvotes: 1

Related Questions