cookie1986
cookie1986

Reputation: 895

Regex to split text file in python

I am trying to find a way to parse a string of a transcript into speaker segments (as a list). Speaker labels are denoted by the upper-casing of the speaker's name followed by a colon. The problem I am having is some names have a number of non upper-case characters. Examples might include the following:

OBAMA: said something

O'MALLEY: said something else

GOV. HICKENLOOPER: said something else entirely'

I have written the following regex, but I am struggling to get it to work:

mystring = "OBAMA: said something \nO'MALLEY: said something else \nGOV. HICKENLOOPER: said something else entirely"

parse_turns = re.split(r'\n(?=[A-Z]+(\ |\.|\'|\d)*[A-Z]*:)', mystring)

What I think I have written (and ideally what I want to do) is a command to split the string based on:

1. Find a newline

2. Use positive look-ahead for one or more uppercase characters

3. If upper-case characters are found look for optional characters from the list of periods, apostrophes, single spaces, and digits

4. If these optional characters are found, look for additional uppercase characters.

5. Crucially, find a colon symbol at the end of this sequence.

EDIT: In many cases, the content of the speech will have newline characters contained within it, and possibly colon symbols. As such, the only thing separating the speaker label from the content of speech is the sequence mentioned above.

Upvotes: 0

Views: 439

Answers (2)

flgang
flgang

Reputation: 114

just change (\ |.|\'|\d) to [\ .\'\d] or (?:\ |.|\'|\d)

import re

mystring = "OBAMA: said something \nO'MALLEY: said something else \nGOV. HICKENLOOPER: said something else entirely"

parse_turns = re.split(r'\n(?=[A-Z]+[\ \.\'\d]*[A-Z]*:)', mystring)
print(parse_turns)

Upvotes: 3

Thomas Kimber
Thomas Kimber

Reputation: 11107

If it's true that the speaker's name and what they said are separated by a colon, then it might be simpler to move away from regex to do your splitting.

list_of_things = []
mystring = "OBAMA: Hi\nO'MALLEY: True Dat\nHUCK FINN: Sure thing\n"
lines = mystring.split("\n")# 1st split the string into lines based on the \n character
for line in lines:
    colon_pos = line.find(":",0)  # Finds the position of the first colon in the line
    speaker, utterance = line[0:colon_pos].strip(), line[colon_pos+1:].strip()
    list_of_things.append((speaker, utterance))

At the end, you should have a neat list of tuples containing speakers, and the things they said.

Upvotes: 1

Related Questions