Reputation: 51
More specifically, I'm working with a txt file of a dialogue between a few different speakers. I'm trying to separate out the text according to speaker. The file looks like this:
S1: that's (your schedule.)
S2: okay
S1: thanks a lot Sue i think i feel much better now.
S2: so it's really nice to talk to somebody who's not dropping a class <S1: LAUGH> i have to say. you don't know how many dropped just like_ that's good <WALKS S1 TO THE DOOR>[S1: alright ] okay [S1: thanks a lot Sue ] bye-bye next... you're Marianne?
S3: yeah. hi
S2: hi i'm Sue
S3: hi
S2: um did she hit you out there?
So I've created an empty data frame to store everything that each speakers says. I want everything that S1 says to be stored separately than everything S2 says, and so on.
My problem is I don't know how to capture the text between each speaker ID. So, for example, when it sees 'S1:', it will capture all the text until it sees 'S2:'. Here is what I'm currently working with. The result is 'None'.
df=pd.DataFrame(columns=['Speaker ID', 'Role', 'Genger', 'Utterance', '# of Words', '# of Pronouns', '1st P SG', '1st P PL', '2nd P'])
#print (df)
f=open('Academic_Advising.txt').read()
try:
pattern=re.search('^S\d+', f).group()
except: AttributeError
pattern=None
print (pattern)
Sorry if this is long-winded. I'm just trying to be clear.
Upvotes: 1
Views: 55
Reputation: 43169
Imo, use a defaultdict in combination with a simple regular expression:
from collections import defaultdict
import re
text = """
S1: that's (your schedule.)
S2: okay
S1: thanks a lot Sue i think i feel much better now.
S2: so it's really nice to talk to somebody who's not dropping a class <S1: LAUGH> i have to say. you don't know how many dropped just like_ that's good <WALKS S1 TO THE DOOR>[S1: alright ] okay [S1: thanks a lot Sue ] bye-bye next... you're Marianne?
S3: yeah. hi
S2: hi i'm Sue
S3: hi
S2: um did she hit you out there?
"""
dd = defaultdict(list)
rx = re.compile(r'(S\d+):\s+(.+)', re.MULTILINE)
for match in rx.finditer(text):
key, value = match.groups()
dd[key].append(value)
print(dd)
Which yields
defaultdict(<class 'list'>,
{'S1': ["that's (your schedule.)", 'thanks a lot Sue i think i feel much better now.'],
'S2': ['okay', "so it's really nice to talk to somebody who's not dropping a class <S1: LAUGH> i have to say. you don't know how many dropped just like_ that's good <WALKS S1 TO THE DOOR>[S1: alright ] okay [S1: thanks a lot Sue ] bye-bye next... you're Marianne?", "hi i'm Sue", 'um did she hit you out there?'],
'S3': ['yeah. hi', 'hi']})
You can easily transfer a dict to a dataframe afterwards.
Upvotes: 1
Reputation: 607
Maybe this can help you on your journey:
import re
import pandas as pd
def main() -> pd.DataFrame:
s = """S1: that's (your schedule.)
S2: okay
S1: thanks a lot Sue i think i feel much better now.
S2: so it's really nice to talk to somebody who's not dropping a class <S1: LAUGH> i have to say. you don't know how many dropped just like_ that's good <WALKS S1 TO THE DOOR>[S1: alright ] okay [S1: thanks a lot Sue ] bye-bye next... you're Marianne?
S3: yeah. hi
S2: hi i'm Sue
S3: hi
S2: um did she hit you out there?"""
messages = {}
data_rows = []
splitted = re.split("(S\d+:)", s)
print(splitted)
for i, message in enumerate(splitted):
if (i+1) % 2 == 0:
print(f"use {message} as key ")
speaker_id = message
message = splitted[i+1]
message_id = messages.get(speaker_id, 0)
word_count = len(message)
data_rows.append([speaker_id, message, word_count])
df = pd.DataFrame(data_rows, columns=["SpeckerId", "Message", "WordCount"])
print(df.head())
return df
if __name__ == '__main__':
main()
Upvotes: 0