Kenna Reagan
Kenna Reagan

Reputation: 51

How do I capture everything after a pattern until I find another instance of the pattern in my text?

More specifically, I'm working with a txt file of a dialogue between a few different speakers. I'm trying to separate out the text according to speaker. The file looks like this:

    S1: that's (your schedule.)
    S2: okay
    S1: thanks a lot Sue i think i feel much better now.
    S2: so it's really nice to talk to somebody who's not dropping a class <S1: LAUGH> i have to say. you don't know how many dropped just like_ that's good <WALKS S1 TO THE DOOR>[S1: alright ] okay [S1: thanks a lot Sue ] bye-bye next... you're Marianne?
    S3: yeah. hi
    S2: hi i'm Sue
    S3: hi
    S2: um did she hit you out there?

So I've created an empty data frame to store everything that each speakers says. I want everything that S1 says to be stored separately than everything S2 says, and so on.

My problem is I don't know how to capture the text between each speaker ID. So, for example, when it sees 'S1:', it will capture all the text until it sees 'S2:'. Here is what I'm currently working with. The result is 'None'.

df=pd.DataFrame(columns=['Speaker ID', 'Role', 'Genger', 'Utterance', '# of Words', '# of Pronouns', '1st P SG', '1st P PL', '2nd P'])
#print (df)
f=open('Academic_Advising.txt').read()
try:
    pattern=re.search('^S\d+', f).group()
except: AttributeError
pattern=None
print (pattern)

Sorry if this is long-winded. I'm just trying to be clear.

Upvotes: 1

Views: 55

Answers (2)

Jan
Jan

Reputation: 43169

Imo, use a defaultdict in combination with a simple regular expression:

from collections import defaultdict
import re

text = """
S1: that's (your schedule.)
S2: okay
S1: thanks a lot Sue i think i feel much better now.
S2: so it's really nice to talk to somebody who's not dropping a class <S1: LAUGH> i have to say. you don't know how many dropped just like_ that's good <WALKS S1 TO THE DOOR>[S1: alright ] okay [S1: thanks a lot Sue ] bye-bye next... you're Marianne?
S3: yeah. hi
S2: hi i'm Sue
S3: hi
S2: um did she hit you out there?
"""

dd = defaultdict(list)
rx = re.compile(r'(S\d+):\s+(.+)', re.MULTILINE)

for match in rx.finditer(text):
    key, value = match.groups()
    dd[key].append(value)

print(dd)

Which yields

defaultdict(<class 'list'>, 
    {'S1': ["that's (your schedule.)", 'thanks a lot Sue i think i feel much better now.'], 
     'S2': ['okay', "so it's really nice to talk to somebody who's not dropping a class <S1: LAUGH> i have to say. you don't know how many dropped just like_ that's good <WALKS S1 TO THE DOOR>[S1: alright ] okay [S1: thanks a lot Sue ] bye-bye next... you're Marianne?", "hi i'm Sue", 'um did she hit you out there?'], 
     'S3': ['yeah. hi', 'hi']})

You can easily transfer a dict to a dataframe afterwards.

Upvotes: 1

Marvin Taschenberger
Marvin Taschenberger

Reputation: 607

Maybe this can help you on your journey:

import re
import pandas as pd 

def main() -> pd.DataFrame: 
    s = """S1: that's (your schedule.)
S2: okay
S1: thanks a lot Sue i think i feel much better now.
S2: so it's really nice to talk to somebody who's not dropping a class <S1: LAUGH> i have to say. you don't know how many dropped just like_ that's good <WALKS S1 TO THE DOOR>[S1: alright ] okay [S1: thanks a lot Sue ] bye-bye next... you're Marianne?
S3: yeah. hi
S2: hi i'm Sue
S3: hi
S2: um did she hit you out there?"""
    messages = {}
    data_rows = []
    splitted = re.split("(S\d+:)", s)
    print(splitted)
    for i, message in enumerate(splitted):
        if (i+1) % 2 == 0:
            print(f"use {message} as key ")
            speaker_id = message
            message = splitted[i+1]
            message_id = messages.get(speaker_id, 0)
            word_count = len(message)
            data_rows.append([speaker_id, message, word_count])
    df = pd.DataFrame(data_rows, columns=["SpeckerId", "Message", "WordCount"])

    print(df.head())
    return df 

if __name__ == '__main__':
    main()

Upvotes: 0

Related Questions