user3712940
user3712940

Reputation: 39

How do I parse attribute values from multiple xml files to one pandas dataframe?

I have several XML files in a folder. They are system generated, and they pop up each night. There can be anywhere from 1 to 200 of them every night. The structure is rigid and never changes. They contain more data than in the examples I've provided, but the timestamp data is sufficient to address my problem.

What I am doing is writing a script (also only the part of the script where I am facing the problem is included below) that takes the data in them and puts it into a pandas dataframe for further handling, and then deletes the files from the folder.

My XML files look like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
    <scans>
        <scan timestamp="20200909T08:13:42" more_attributes="more_values"/>
    </scans>
<scan>
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
    <scans>
        <scan timestamp="20200909T08:22:55" more_attributes="more_values"/>
    </scans>
<scan>

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
    <scans>
        <scan timestamp="20200909T08:29:27" more_attributes="more_values"/>
    </scans>
<scan>

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
    <scans>
        <scan timestamp="20200909T08:41:41" more_attributes="more_values"/>
    </scans>
<scan>

My script looks like this:

import os
import pandas as pd 
import xml.etree.ElementTree as et 

path = 'my\\path'
df_cols = ['timestamp']
rows = []

for filename in os.listdir(path):
    if filename.endswith('.xml'):
        fullname = os.path.join(path, filename)
        xtree = et.parse(fullname)
        xroot = xtree.getroot() 
        scans = xroot.find('scans')
        scan = scans.findall('scan')
        for n in scan:
            s_timestamp = n.attrib.get('timestamp')
                
            rows.append({'timestamp': s_timestamp})                    
            out_df = pd.DataFrame(rows, columns = df_cols)

Now if I print(s_timestamp) I get:

20200909T08:13:42
20200909T08:22:55
20200909T08:29:27
20200909T08:41:41

This is what I want my dataframe to contain after appending. But if I print(rows) I get this:

[{'timestamp': '20200909T08:13:42'}]
[{'timestamp': '20200909T08:13:42'}, {'timestamp': '20200909T08:22:55'}]
[{'timestamp': '20200909T08:13:42'}, {'timestamp': '20200909T08:22:55'}, {'timestamp': '20200909T08:29:27'}]
[{'timestamp': '20200909T08:13:42'}, {'timestamp': '20200909T08:22:55'}, {'timestamp': '20200909T08:29:27'}, {'timestamp': '20200909T08:41:41'}]

And as a result I get four results when I print(out_df) as well:

              timestamp
0     20200909T08:13:42
              timestamp
0     20200909T08:13:42
1     20200909T08:22:55
              timestamp
0     20200909T08:13:42
1     20200909T08:22:55
2     20200909T08:29:27
              timestamp
0     20200909T08:13:42
1     20200909T08:22:55
2     20200909T08:29:27
3     20200909T08:41:41

While the result I am looking for is:

              timestamp
0     20200909T08:13:42
1     20200909T08:22:55
2     20200909T08:29:27
3     20200909T08:41:41

I understand that something in the loop and append is causing this, but I just can't see why it is happening.

Upvotes: 2

Views: 1412

Answers (1)

balderman
balderman

Reputation: 23815

create the df once with the rows:

for filename in os.listdir(path):
    if filename.endswith('.xml'):
        fullname = os.path.join(path, filename)
        xtree = et.parse(fullname)
        xroot = xtree.getroot() 
        scans = xroot.find('scans')
        scan = scans.findall('scan')
        for n in scan:
            s_timestamp = n.attrib.get('timestamp')
                
            rows.append({'timestamp': s_timestamp})                    
out_df = pd.DataFrame(rows, columns = df_cols)

Upvotes: 4

Related Questions