Bruce Walthers
Bruce Walthers

Reputation: 13

In python, I want to use regex to look for an embedded list of items inside a string

I want to be able to split the following string:

"This is a string with an embedded list.  1. My first list item.  2. My second item.  a. My first sub-item.  b. My second sub-item.  3. My last list item."

I would like to split it as:

"This is a string with an embedded list."
"1. My first list item."
"2. My second item."
"a. My first sub-item."
"b. My second sub-item."
"3. My last list item."

I cannot guarantee that each embedded list item will always have two spaces preceding it but it will have at least one or it will start the string. Also, I cannot guarantee that the first word in an embedded list will always be capitalized. Lastly, the numbered and lettered portion inside the string could go into the teens in terms of numbers so it is possible to get an entry starting with say "10. ". If there is no embedded list, I would like this to just return the original string, no splitting required.

In terms of rules to identify an embedded list item, here are some of my thoughts:

  1. It will always have some amount of whitespace in front of it, one or more spaces, or it might start the string.
  2. After the whitespace or start of string, it will have 1 to 2 digits followed by a period or a single character followed by a period. The character may or may not be capitalized.

While this is not an exhaustive set of conditions, I think it will find a good amount of embedded lists.

Upvotes: 0

Views: 36

Answers (1)

Nick
Nick

Reputation: 147166

You could split using this regex, which looks for some number of spaces followed by either digits and a period or a letter and a period:

\s+(?=(?:\d+|[a-z])\.)

In python (note use of re.I flag to match upper and lower case letters):

import re

s = "This is a string with an embedded list.  1. My first list item.  2. My second item.  a. My first sub-item.  b. My second sub-item.  3. My last list item."

print(re.split(r'\s+(?=(?:\d+|[a-z])\.)', s, 0, re.I))

Output:

[
 'This is a string with an embedded list.',
 '1. My first list item.',
 '2. My second item.',
 'a. My first sub-item.',
 'b. My second sub-item.',
 '3. My last list item.'
]

Upvotes: 1

Related Questions