How do I capture everything after a pattern until I find another instance of the pattern in my text?

Question

More specifically, I'm working with a txt file of a dialogue between a few different speakers. I'm trying to separate out the text according to speaker. The file looks like this:

    S1: that's (your schedule.)
    S2: okay
    S1: thanks a lot Sue i think i feel much better now.
    S2: so it's really nice to talk to somebody who's not dropping a class  i have to say. you don't know how many dropped just like_ that's good [S1: alright ] okay [S1: thanks a lot Sue ] bye-bye next... you're Marianne?
    S3: yeah. hi
    S2: hi i'm Sue
    S3: hi
    S2: um did she hit you out there?

So I've created an empty data frame to store everything that each speakers says. I want everything that S1 says to be stored separately than everything S2 says, and so on.

My problem is I don't know how to capture the text between each speaker ID. So, for example, when it sees 'S1:', it will capture all the text until it sees 'S2:'. Here is what I'm currently working with. The result is 'None'.

df=pd.DataFrame(columns=['Speaker ID', 'Role', 'Genger', 'Utterance', '# of Words', '# of Pronouns', '1st P SG', '1st P PL', '2nd P'])
#print (df)
f=open('Academic_Advising.txt').read()
try:
    pattern=re.search('^S\d+', f).group()
except: AttributeError
pattern=None
print (pattern)

Sorry if this is long-winded. I'm just trying to be clear.

Jan · Accepted Answer

Imo, use a defaultdict in combination with a simple regular expression:

from collections import defaultdict
import re

text = """
S1: that's (your schedule.)
S2: okay
S1: thanks a lot Sue i think i feel much better now.
S2: so it's really nice to talk to somebody who's not dropping a class  i have to say. you don't know how many dropped just like_ that's good [S1: alright ] okay [S1: thanks a lot Sue ] bye-bye next... you're Marianne?
S3: yeah. hi
S2: hi i'm Sue
S3: hi
S2: um did she hit you out there?
"""

dd = defaultdict(list)
rx = re.compile(r'(S\d+):\s+(.+)', re.MULTILINE)

for match in rx.finditer(text):
    key, value = match.groups()
    dd[key].append(value)

print(dd)

Which yields

defaultdict(, 
    {'S1': ["that's (your schedule.)", 'thanks a lot Sue i think i feel much better now.'], 
     'S2': ['okay', "so it's really nice to talk to somebody who's not dropping a class  i have to say. you don't know how many dropped just like_ that's good [S1: alright ] okay [S1: thanks a lot Sue ] bye-bye next... you're Marianne?", "hi i'm Sue", 'um did she hit you out there?'], 
     'S3': ['yeah. hi', 'hi']})

You can easily transfer a dict to a dataframe afterwards.

How do I capture everything after a pattern until I find another instance of the pattern in my text?

Answers (2)

Related Questions