pylearner
pylearner

Reputation: 1460

Converting xml data to a dataframe

How can I convert the XML data in to a dataframe with the format given below.

<start>
    <main index = '1', sub = 'english' >
        <name value = '1', text = 'hi this is xxx' />
        <name value = '2', text = 'isnt this funny' />
    </main>
    <main index = '2', sub = 'french'>
        <name value = '1', text = 'Comment vas-tu' />
        <name value = '2', text = 'sil vous plaît résoudre ce'>
    </main>
</start>

Expected DataFrame:

mainindex           namevalue           text
A                       1               hi this is xxx
A                       2               isnt this funny
B                       1               Comment vas-tu
B                       2               sil vous plaît résoudre ce

Upvotes: 0

Views: 92

Answers (2)

Heaven
Heaven

Reputation: 536

Another method:

saveFileName = 'yourOwnFileName.txt'

def main():
    mainindex = None

    with open('yourOwnXml.xml', 'r') as f_read:
        with open(saveFileName, 'w') as f_write:
            for line in f_read.readlines():
                if '<main index' in line.strip():
                    mainindex = line.split('\'')[1]
                if '<name value' in line.strip():
                    name_value = line.split('\'')[1]
                    text = line.split('\'')[3]
                    f_write.write('{mainindex} {namevalue} {text}\n'.format(mainindex=mainindex, namevalue=name_value, text=text))

if __name__ == '__main__':
    main()

output in yourOwnFileName.txt should be:

1 1 hi this is xxx
1 2 isnt this funny
2 1 Comment vas-tu
2 2 sil vous plaît résoudre ce

Upvotes: 1

iamklaus
iamklaus

Reputation: 3770

like BeautifulSoup ?

data = """<start>
    <main index = '1', sub = 'english' >
        <name value = '1', text = 'hi this is xxx' />
        <name value = '2', text = 'isnt this funny' />
    </main>
    <main index = '2', sub = 'french'>
        <name value = '1', text = 'Comment vas-tu' />
        <name value = '2', text = 'sil vous plaît résoudre ce'>
    </main>
</start>"""

data = BeautifulSoup(data)

headers = ['mainIndex','nameValue','text']

dataframe = pd.DataFrame(columns=headers)
pos = 0
i = 0
for m in data.find_all('main'):
    for name in m.find_all('name'):
        d = []
        d.append(chr(ord('A')+i))
        d.append(name.get('value'))
        d.append(name.get('text'))

        dataframe.loc[pos] = d
        pos+=1
    i+=1    

print(dataframe)

  mainIndex nameValue                        text
0         A         1              hi this is xxx
1         A         2             isnt this funny
2         B         1              Comment vas-tu
3         B         2  sil vous plaît résoudre ce

Upvotes: 0

Related Questions