Reputation: 13
I want to be able to split the following string:
"This is a string with an embedded list. 1. My first list item. 2. My second item. a. My first sub-item. b. My second sub-item. 3. My last list item."
I would like to split it as:
"This is a string with an embedded list."
"1. My first list item."
"2. My second item."
"a. My first sub-item."
"b. My second sub-item."
"3. My last list item."
I cannot guarantee that each embedded list item will always have two spaces preceding it but it will have at least one or it will start the string. Also, I cannot guarantee that the first word in an embedded list will always be capitalized. Lastly, the numbered and lettered portion inside the string could go into the teens in terms of numbers so it is possible to get an entry starting with say "10. ". If there is no embedded list, I would like this to just return the original string, no splitting required.
In terms of rules to identify an embedded list item, here are some of my thoughts:
While this is not an exhaustive set of conditions, I think it will find a good amount of embedded lists.
Upvotes: 0
Views: 36
Reputation: 147166
You could split using this regex, which looks for some number of spaces followed by either digits and a period or a letter and a period:
\s+(?=(?:\d+|[a-z])\.)
In python (note use of re.I
flag to match upper and lower case letters):
import re
s = "This is a string with an embedded list. 1. My first list item. 2. My second item. a. My first sub-item. b. My second sub-item. 3. My last list item."
print(re.split(r'\s+(?=(?:\d+|[a-z])\.)', s, 0, re.I))
Output:
[
'This is a string with an embedded list.',
'1. My first list item.',
'2. My second item.',
'a. My first sub-item.',
'b. My second sub-item.',
'3. My last list item.'
]
Upvotes: 1