leena
leena

Reputation: 563

converting xml to Pandas dataframe

I have a large number of XML files ~ 3000. Each xml file consists of a single user tweets. The file name is the user ID. I want to create a pandas dataframe that consists of 3000 rows and two columns. One column is user_id, and the other column is user tweets.

I was able to extract the content of one sample XML file and save it in a list.

#parse the xml file 
mytree=ET.parse('1a6446f74e20c558a2cef325394499.xml')
myroot=mytree.getroot()

tweet_list=[element.text for element in myroot.iter('document')]

Sample of my xml

<author lang="en">
    <documents>
        <document><![CDATA[I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive... ]]></document>
        <document><![CDATA[Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal.]]></document>
        <document><![CDATA[happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M ]]></document>
    </documents>
</author>

I want to apply this code to all my xml files that are located in one directory. And then convert the lists to be rows in a data frame.

I tried this code to get the content of the files but I can't get passed myroot

import os
path = './data'

    for filename in os.listdir(path):
        if not filename.endswith('.xml'): 
            continue
        fullname = os.path.join(path, filename)
        #print(fullname)
        mytree = ET.parse(fullname)
        myroot=mytree.getroot()

Any tip would be helpful.

Upvotes: 0

Views: 1210

Answers (2)

min
min

Reputation: 11

I've created a package because I have a similar use case.

pip install pandas_read_xml

Here is how you might use it. Say all the xml files are in authors.zip file.

import pandas_read_xml as pdx

df = pdx.read_xml('authors.zip')

Although, that xml format is not one that I have anticipated, so may need to check out what it does.

Upvotes: 1

Trenton McKinney
Trenton McKinney

Reputation: 62393

  • The following code will use Path.rglob from the pathlib module to find all files
  • This will create a single dataframe with tweets from all user_id files
  • As an example, with your sample data in three files named test_00.xml, test_01.xml, and test_02.xml, in a directory named xml
  • 2.94 s ± 20.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) for 4640 files, each containing 3 tweets.

Option 1: Outputs 1 tweet per row

from pathlib import Path
import xml.etree.ElementTree as ET
import pandas as pd

# path to top directory
p = Path('xml')

# find all files
files = p.rglob('*.xml')

# create dataframe
df_list = list()
for file in files:
    mytree=ET.parse(file)
    myroot=mytree.getroot()
    tweet_list=[element.text for element in myroot.iter('document')]
    df_list.append(pd.DataFrame({'user_id': file.stem, 'tweets': tweet_list}))

df = pd.concat(df_list).reset_index(drop=True)

Output 1

 user_id                                                                                                                 tweets
 test_00   I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive... 
 test_00  Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal.
 test_00            happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M 
 test_01   I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive... 
 test_01  Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal.
 test_01            happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M 
 test_02   I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive... 
 test_02  Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal.
 test_02            happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M 

Option 2: Outputs 1 row per user_id

p = Path('xml')
files = p.rglob('*.xml')

df_list = list()
for file in files:
    mytree=ET.parse(file)
    myroot=mytree.getroot()
    tweet_list = [[element.text for element in myroot.iter('document')]]
    df_list.append(pd.DataFrame({'user_id': file.stem, 'tweets': tweet_list }))

df = pd.concat(df_list).reset_index(drop=True)

Output 2

 user_id                                                                                                                                                                                                                                                                                                                                                      tweets
 test_00  [I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive... , Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal., happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M ]
 test_01  [I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive... , Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal., happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M ]
 test_02  [I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive... , Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal., happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M ]

Option 3: Using collections.defaultdict - 1 row per tweet

  • This solution is suggested by sammywemmy
  • The output df is the same as Output 1
  • 806 ms ± 14.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) for 4640 files, each containing 3 tweets.
from collections import defaultdict
from pathlib import Path
import xml.etree.ElementTree as ET
import pandas as pd

# path to top directory
p = Path('xml')

# find all files
files = p.rglob('*.xml')

box = defaultdict(list)
for file in files:
    root = ET.parse(file).getroot()
    for element in root.iter("document"):
        box[file.stem].append(element.text)

# get the final data into a dataframe
# use T (transpose) and stack
df = pd.DataFrame(pd.DataFrame(box).T.stack()).reset_index(level=0).reset_index(drop=True).rename(columns={'level_0': 'user_id', 0: 'tweets'})

Option 4: Using collections.defaultdict - 1 row per user_id

  • The output df is the same as Output 2
p = Path('xml')
files = p.rglob('*.xml')

box = defaultdict(list)
for file in files:
    root = ET.parse(file).getroot()
    box[file.stem].append([element.text for element in myroot.iter('document')])

df = pd.DataFrame(pd.DataFrame(box).T.stack()).reset_index(level=0).reset_index(drop=True).rename(columns={'level_0': 'user_id', 0: 'tweets'})

Upvotes: 1

Related Questions