How do I get text I compiled from a txt file into its corresponding row in a dataframe?

Question

Here is part of the txt file I'm working with:

HIN

And here is the dataframe I have that I want to add columns of information to:

 speaker_id                                         utterances  #_of_words  \
0         S1  [alright, sue, now, it, s, like, uh, i, droppe...        2570   
1         S2  [this, year, this, term, ri, oh, but, you, dro...       20475   
2         S3  [yeah, hi, hi, yeah, i, already, signed, s2, o...         945   
3         S4  [back, in, i, was, like, w, what, is, that, ye...        2133   
4         S5  [okay, well, i, m, not, here, for, a, drop, ad...        1229   
5         S6  [me, yeah, that, s, right, i, have, a, questio...        1027   
6         S7  [hello, hi, what, was, your, name, i, thought,...          93   

   1p_sg  1p_pl    2p  #_of_pronouns  
0    220      6    31            257  
1    575     37  1534           2146  
2    102      0    12            114  
3    181     11    60            252  
4    120      3    17            140  
5     97      1    11            109  
6      6      1     3             10

I'm trying to add two columns, 'role' and 'gender', to my dataframe. I want to extract that information from the txt file I have above. As you can see, there is a speaker id that is associated with a specific role and gender. So, for example, in the the first row where it says S1, I would want the 'role' column to say "JU, Student" and the 'gender' column to say "M" since those are the corresponding role and gender. I already have the following commands to compile that info:

role=re.compile('ROLE="(.+?)"')
gender=re.compile('SEX="(.+?)"')

I just don't know how to get it to the corresponding row in the dataframe. How do I do this?

Aditya · Accepted Answer

You were very close but this is how you would do it using regex:

xml_file = """
 HIN 









"""

person_list = re.findall(r'(ID=\".+\")\s(LANG=\".+\")\s(ROLE=\".+\")\s(SEX=\".+\")\s(RESTRICT=\".+\")\s(AGE=\".+\")', xml_file)

df2 = pd.DataFrame([{x.split('=')[0] : x.split('=')[1].replace('"', '') for x in person} for person in person_list])

# df2 = df2[['ID', 'ROLE', 'GENDER']] #Choose which columns you want to keep.

df_merge = pd.merge(left=df1, right=df2, how='left', right_on='ID', left_on='speaker_id')

Let me know if that works for you.

How do I get text I compiled from a txt file into its corresponding row in a dataframe?

Answers (1)

Related Questions