Reputation: 455
I've been struggling to split my string by a regex expression in Python.
I have a text file which I load that is in the format of:
"Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch
at Kate's house. Kyle went home at 9. \nSome other sentence
here\n\u2022Here's a bulleted line"
I'd like to get the following output:
['Peter went to the gym; he worked out for two hours','Kyle ate lunch
at Kate's house. He went home at 9.', 'Some other sentence here',
'\u2022Here's a bulleted line']
I'm looking to split my string by a new line and a capital letter or a bullet point in Python.
I've tried tackling the first half of the problem, splitting my string by just a new line and capital letter.
Here's what I have so far:
print re.findall(r'\n[A-Z][a-z]+',str,re.M)
This just gives me:
[u'\nKyle', u'\nSome']
which is just the first word. I've tried variations of that regex expression but I don't know how to get the rest of the line.
I assume that to also split by the bullet point, I would just include an OR regex expression that is in the same format as the regex of splitting by a capital letter. Is this the best way?
I hope this makes sense and I'm sorry if my question is in anyway unclear. :)
Upvotes: 2
Views: 1990
Reputation: 785196
You can use this split
function:
>>> str = u"Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch at Kate's house. Kyle went home at 9. \nSome other sentence here\n\u2022Here's a bulleted line"
>>> print re.split(u'\n(?=\u2022|[A-Z])', str)
[u'Peter went to the gym; \nhe worked out for two hours ',
u"Kyle ate lunch at Kate's house. Kyle went home at 9. ",
u'Some other sentence here',
u"\u2022Here's a bulleted line"]
Upvotes: 1
Reputation: 71451
You can split at a \n
proceeded by a capital letter or the bullet character:
import re
s = """
Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch
at Kate's house. Kyle went home at 9. \nSome other sentence
here\n\u2022Here's a bulleted line
"""
new_list = filter(None, re.split('\n(?=•)|\n(?=[A-Z])', s))
Output:
['Peter went to the gym; \nhe worked out for two hours ', "Kyle ate lunch \nat Kate's house. Kyle went home at 9. ", 'Some other sentence \nhere', "•Here's a bulleted line\n"]
Or, without using the symbol for the bullet character:
new_list = filter(None, re.split('\n(?=\u2022)|\n(?=[A-Z])', s))
Upvotes: 1