Reputation: 563
I have a large number of XML files ~ 3000. Each xml file consists of a single user tweets. The file name is the user ID. I want to create a pandas dataframe that consists of 3000 rows and two columns. One column is user_id
, and the other column is user tweets
.
I was able to extract the content of one sample XML file and save it in a list.
#parse the xml file
mytree=ET.parse('1a6446f74e20c558a2cef325394499.xml')
myroot=mytree.getroot()
tweet_list=[element.text for element in myroot.iter('document')]
Sample of my xml
<author lang="en">
<documents>
<document><![CDATA[I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive... ]]></document>
<document><![CDATA[Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal.]]></document>
<document><![CDATA[happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M ]]></document>
</documents>
</author>
I want to apply this code to all my xml files that are located in one directory. And then convert the lists to be rows in a data frame.
I tried this code to get the content of the files but I can't get passed myroot
import os
path = './data'
for filename in os.listdir(path):
if not filename.endswith('.xml'):
continue
fullname = os.path.join(path, filename)
#print(fullname)
mytree = ET.parse(fullname)
myroot=mytree.getroot()
Any tip would be helpful.
Upvotes: 0
Views: 1210
Reputation: 11
I've created a package because I have a similar use case.
pip install pandas_read_xml
Here is how you might use it. Say all the xml files are in authors.zip file.
import pandas_read_xml as pdx
df = pdx.read_xml('authors.zip')
Although, that xml format is not one that I have anticipated, so may need to check out what it does.
Upvotes: 1
Reputation: 62393
Path.rglob
from the pathlib module to find all filesuser_id
filestest_00.xml
, test_01.xml
, and test_02.xml
, in a directory named xml
2.94 s ± 20.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
for 4640 files, each containing 3 tweets.from pathlib import Path
import xml.etree.ElementTree as ET
import pandas as pd
# path to top directory
p = Path('xml')
# find all files
files = p.rglob('*.xml')
# create dataframe
df_list = list()
for file in files:
mytree=ET.parse(file)
myroot=mytree.getroot()
tweet_list=[element.text for element in myroot.iter('document')]
df_list.append(pd.DataFrame({'user_id': file.stem, 'tweets': tweet_list}))
df = pd.concat(df_list).reset_index(drop=True)
user_id tweets
test_00 I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive...
test_00 Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal.
test_00 happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M
test_01 I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive...
test_01 Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal.
test_01 happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M
test_02 I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive...
test_02 Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal.
test_02 happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M
user_id
p = Path('xml')
files = p.rglob('*.xml')
df_list = list()
for file in files:
mytree=ET.parse(file)
myroot=mytree.getroot()
tweet_list = [[element.text for element in myroot.iter('document')]]
df_list.append(pd.DataFrame({'user_id': file.stem, 'tweets': tweet_list }))
df = pd.concat(df_list).reset_index(drop=True)
user_id tweets
test_00 [I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive... , Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal., happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M ]
test_01 [I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive... , Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal., happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M ]
test_02 [I see my page views are up 433% of late....now that definitely has my attention.To all you lovely and supportive... , Howdy fans - I've moved another 35 spots today,up the "Global Reverbnation Country Charts",getting closer to my goal., happy Memorial Day weekend this is a song I wrote for the veterans that suffer from p,d,s,d - Watch on N1M ]
collections.defaultdict
- 1 row per tweetdf
is the same as Output 1
806 ms ± 14.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
for 4640 files, each containing 3 tweets.from collections import defaultdict
from pathlib import Path
import xml.etree.ElementTree as ET
import pandas as pd
# path to top directory
p = Path('xml')
# find all files
files = p.rglob('*.xml')
box = defaultdict(list)
for file in files:
root = ET.parse(file).getroot()
for element in root.iter("document"):
box[file.stem].append(element.text)
# get the final data into a dataframe
# use T (transpose) and stack
df = pd.DataFrame(pd.DataFrame(box).T.stack()).reset_index(level=0).reset_index(drop=True).rename(columns={'level_0': 'user_id', 0: 'tweets'})
collections.defaultdict
- 1 row per user_id
df
is the same as Output 2
p = Path('xml')
files = p.rglob('*.xml')
box = defaultdict(list)
for file in files:
root = ET.parse(file).getroot()
box[file.stem].append([element.text for element in myroot.iter('document')])
df = pd.DataFrame(pd.DataFrame(box).T.stack()).reset_index(level=0).reset_index(drop=True).rename(columns={'level_0': 'user_id', 0: 'tweets'})
Upvotes: 1