Regex to split text file in python

Question

I am trying to find a way to parse a string of a transcript into speaker segments (as a list). Speaker labels are denoted by the upper-casing of the speaker's name followed by a colon. The problem I am having is some names have a number of non upper-case characters. Examples might include the following:

OBAMA: said something

O'MALLEY: said something else

GOV. HICKENLOOPER: said something else entirely'

I have written the following regex, but I am struggling to get it to work:

mystring = "OBAMA: said something 
O'MALLEY: said something else 
GOV. HICKENLOOPER: said something else entirely"

parse_turns = re.split(r'
(?=[A-Z]+(\ |\.|\'|\d)*[A-Z]*:)', mystring)

What I think I have written (and ideally what I want to do) is a command to split the string based on:

1. Find a newline

2. Use positive look-ahead for one or more uppercase characters

3. If upper-case characters are found look for optional characters from the list of periods, apostrophes, single spaces, and digits

4. If these optional characters are found, look for additional uppercase characters.

5. Crucially, find a colon symbol at the end of this sequence.

EDIT: In many cases, the content of the speech will have newline characters contained within it, and possibly colon symbols. As such, the only thing separating the speaker label from the content of speech is the sequence mentioned above.

flgang · Accepted Answer

just change (\ |.|\'|\d) to [\ .\'\d] or (?:\ |.|\'|\d)

import re

mystring = "OBAMA: said something 
O'MALLEY: said something else 
GOV. HICKENLOOPER: said something else entirely"

parse_turns = re.split(r'
(?=[A-Z]+[\ \.\'\d]*[A-Z]*:)', mystring)
print(parse_turns)

Regex to split text file in python

Answers (2)

Related Questions