Sampath Rajapaksha
Sampath Rajapaksha

Reputation: 101

convert xml to pandas data framework

i want to convert nps chat corpus in to pandas dataframe. there are 16 xml files

10-19-20s_706posts.xml, 10-19-30s_705posts.xml, 10-19-40s_686posts.xml, 10-19-adults_706posts.xml, 10-24-40s_706posts.xml, 10-26-teens_706posts.xml, 11-06-adults_706posts.xml, 11-08-20s_705posts.xml, 11-08-40s_706posts.xml, 11-08-adults_705posts.xml, 11-08-teens_706posts.xml, 11-09-20s_706posts.xml, 11-09-40s_706posts.xml, 11-09-adults_706posts.xml, 11-09-teens_706posts.xml

inside nps_chat and i want all to get in to single dataframe.

Here is a sample post from the corpus:

<!-- edited with XMLSpy v2007 sp1 (http://www.altova.com) by Eric Forsyth (Naval Postgraduate School) -->
<Session xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="postClassPOSTagset.xsd">
    <Posts>
        <Post class="Statement" user="10-19-20sUser7">now im left with this gay name<terminals>
                <t pos="RB" word="now"/>
                <t pos="PRP" word="im"/>
                <t pos="VBD" word="left"/>
                <t pos="IN" word="with"/>
                <t pos="DT" word="this"/>
                <t pos="JJ" word="gay"/>
                <t pos="NN" word="name"/>
            </terminals>
        </Post>
        <Post class="Emotion" user="10-19-20sUser7">:P<terminals>
                <t pos="UH" word=":P"/>
            </terminals>
        </Post>
        <Post class="System" user="10-19-20sUser76">PART<terminals>
                <t pos="VB" word="PART"/>
            </terminals>

from this i only need class and relevant text to the pandas framework for example

      Class         text
1     Statement     now im left with this gay name
2     Emotion       :P
3     System        PART

i could get text in to pandas using below

from nltk.corpus import nps_chat as nps
import pandas as pd
import numpy as np
chatroom = nps.posts()
df = pd.DataFrame(np.array(chatroom),columns=["text"]) 

output enter image description here

is there any method to get the class? that is the only missing part

Upvotes: 1

Views: 1219

Answers (1)

Elvin Valiev
Elvin Valiev

Reputation: 454

What about something like this ?

from nltk.corpus import nps_chat
data = []
for p in nps_chat.xml_posts():
    data.append({"class":p.get("class"), "text": p.text})
df = pd.DataFrame.from_dict(data)
df

enter image description here

Upvotes: 2

Related Questions