Reputation: 51
Here is part of the txt file I'm working with:
<PERSON ID="S1" LANG="NRN" ROLE="JU, Student" SEX="M" RESTRICT="NONE" AGE="1">
<FIRSTLANG> HIN </FIRSTLANG>
</PERSON>
<PERSON ID="S2" LANG="NS" ROLE="ST, Advisor" SEX="F" RESTRICT="NONE" AGE="3"/>
<PERSON ID="S3" LANG="NS" ROLE="JU, Student" SEX="F" RESTRICT="NONE" AGE="1"/>
<PERSON ID="S4" LANG="NS" ROLE="JU, Student" SEX="M" RESTRICT="NONE" AGE="1"/>
<PERSON ID="S5" LANG="NS" ROLE="JU, Student" SEX="F" RESTRICT="NONE" AGE="1"/>
<PERSON ID="S6" LANG="NS" ROLE="JU, Student" SEX="F" RESTRICT="NONE" AGE="1"/>
<PERSON ID="S7" LANG="NS" ROLE="UN, Unknown" SEX="F" RESTRICT="NONE" AGE="1"/>
<PERSON ID="R1" LANG="NS" ROLE="ST, MICASE Researcher" SEX="F" RESTRICT="NONE" AGE="1"/>
<PERSON ID="SU-f" LANG="NS" ROLE="UN, Unknown" SEX="F" RESTRICT="NONE" AGE="0"/>
<PERSON ID="SS" LANG="NS" ROLE="UN, Unkown" SEX="F" RESTRICT="NONE" AGE="0"/>
And here is the dataframe I have that I want to add columns of information to:
speaker_id utterances #_of_words \
0 S1 [alright, sue, now, it, s, like, uh, i, droppe... 2570
1 S2 [this, year, this, term, ri, oh, but, you, dro... 20475
2 S3 [yeah, hi, hi, yeah, i, already, signed, s2, o... 945
3 S4 [back, in, i, was, like, w, what, is, that, ye... 2133
4 S5 [okay, well, i, m, not, here, for, a, drop, ad... 1229
5 S6 [me, yeah, that, s, right, i, have, a, questio... 1027
6 S7 [hello, hi, what, was, your, name, i, thought,... 93
1p_sg 1p_pl 2p #_of_pronouns
0 220 6 31 257
1 575 37 1534 2146
2 102 0 12 114
3 181 11 60 252
4 120 3 17 140
5 97 1 11 109
6 6 1 3 10
I'm trying to add two columns, 'role' and 'gender', to my dataframe. I want to extract that information from the txt file I have above. As you can see, there is a speaker id that is associated with a specific role and gender. So, for example, in the the first row where it says S1, I would want the 'role' column to say "JU, Student" and the 'gender' column to say "M" since those are the corresponding role and gender. I already have the following commands to compile that info:
role=re.compile('ROLE="(.+?)"')
gender=re.compile('SEX="(.+?)"')
I just don't know how to get it to the corresponding row in the dataframe. How do I do this?
Upvotes: 0
Views: 49
Reputation: 1377
You were very close but this is how you would do it using regex:
xml_file = """<PERSON ID="S1" LANG="NRN" ROLE="JU, Student" SEX="M" RESTRICT="NONE" AGE="1">
<FIRSTLANG> HIN </FIRSTLANG>
</PERSON>
<PERSON ID="S2" LANG="NS" ROLE="ST, Advisor" SEX="F" RESTRICT="NONE" AGE="3"/>
<PERSON ID="S3" LANG="NS" ROLE="JU, Student" SEX="F" RESTRICT="NONE" AGE="1"/>
<PERSON ID="S4" LANG="NS" ROLE="JU, Student" SEX="M" RESTRICT="NONE" AGE="1"/>
<PERSON ID="S5" LANG="NS" ROLE="JU, Student" SEX="F" RESTRICT="NONE" AGE="1"/>
<PERSON ID="S6" LANG="NS" ROLE="JU, Student" SEX="F" RESTRICT="NONE" AGE="1"/>
<PERSON ID="S7" LANG="NS" ROLE="UN, Unknown" SEX="F" RESTRICT="NONE" AGE="1"/>
<PERSON ID="R1" LANG="NS" ROLE="ST, MICASE Researcher" SEX="F" RESTRICT="NONE" AGE="1"/>
<PERSON ID="SU-f" LANG="NS" ROLE="UN, Unknown" SEX="F" RESTRICT="NONE" AGE="0"/>
<PERSON ID="SS" LANG="NS" ROLE="UN, Unkown" SEX="F" RESTRICT="NONE" AGE="0"/>"""
person_list = re.findall(r'(ID=\".+\")\s(LANG=\".+\")\s(ROLE=\".+\")\s(SEX=\".+\")\s(RESTRICT=\".+\")\s(AGE=\".+\")', xml_file)
df2 = pd.DataFrame([{x.split('=')[0] : x.split('=')[1].replace('"', '') for x in person} for person in person_list])
# df2 = df2[['ID', 'ROLE', 'GENDER']] #Choose which columns you want to keep.
df_merge = pd.merge(left=df1, right=df2, how='left', right_on='ID', left_on='speaker_id')
Let me know if that works for you.
Upvotes: 1