Python XML to CSV help for novice

Question

I have a large-ish XML file. I include a scrubbed snippet of it below. The result-size below is the number of elements in the XML file that the file calls .

Question 1: How can only get a record in the output CSV file for the every of type , and not of all of the other types? I have no control over how the XML is structured. If you comment out all but the test values in my Python code, you'll see that file is outputting 5 records when I only want 1.

Question 2: What do I need to do differently in my Python to get name = i.find("name").text to return "name of Item 1"?

I think it's fair for the answer to just focus on those first two questions. I hope that getting those answers puts me on a path to solving the rest of my problem with this XML to CSV issue. Here is the big picture and a couple more issues I will need to solve, though. Point me to books, classes, whatever will help. I only have a week or so to make this work.

Desired output:

Collection item,ITEM-ID,ATTRIB-1,PERSON-TYPE-1-NAME,ATTRIB-2,PERSON-TYPE-2-NAME,RELATED-THING-1 id,RELATED-THING-2 IDs
name of Item 1,item_000001,Yes,name of person 1,Yes,name of person 2,thing_000745,"thing_000783, thing_000803"

I can read in the file and output a CSV file with my designated column names. But I can't get the name of even the first thing into the CSV.

Then there are the more complicated bits that will require functions, such as finding an ID attribute based on matching the type and name attributes, and returning more than one ID. See RELATED-THING-1-ID and RELATED-THING-2-IDs.

Here is my Python (based on https://www.geeksforgeeks.org/convert-xml-to-csv-in-python/):

# -*- coding: utf-8 -*-

# Importing the required libraries
import xml.etree.ElementTree as Xet
import pandas as pd

cols = ["Collection item", "ITEM-ID", "ATTRIB-1", "PERSON-TYPE-1-NAME" ,
        "ATTRIB-2", "PERSON-TYPE-2-NAME", "RELATED-THING-1 id",
        "RELATED-THING-2 IDs"]
rows = []

# Parsing the XML file
xmlparse = Xet.parse('sample.xml')
root = xmlparse.getroot()
for i in root:
    name = i.find("name").text
    item_id = i.find("ITEM-ID").text
    attrib_1 = i.find("ATTRIB-1").text
    p1_name = i.find("PERSON-TYPE-1-NAME.result.row.name").text
    attrib_2 = i.find("ATTRIB-2").text
    p2_name = i.find("PERSON-TYPE-2-NAME.result.row.name").text
    relat_thing1_id = i.find("country").text
    relat_thing2_ids = i.find("country").text

    rows.append({"Collection item": name,
                 "ITEM-ID": item_id,
                 "ATTRIB-1": attrib_1,
                 "PERSON-TYPE-1-NAME": p1_name,
                 "ATTRIB-2": attrib_2,
                 "PERSON-TYPE-2-NAME": p2_name,
                 "RELATED-THING-1 id": relat_thing1_id,
                 "RELATED-THING-2 IDs": relat_thing2_ids
    })

df = pd.DataFrame(rows, columns=cols)

# Writing dataframe to csv
df.to_csv('output.csv')

This is the XML:



  
    3255183
    CONTEXT
    collections
  
  
    3652889
    COLLECTION-ITEM
    name of Item 1
    item_000001
    Yes
    
      
        
          3254728
          3254388
          PERSON
          name of person 1
          1
        
      
    
    Yes
    
      
        
          3254403
          3254388
          PERSON
          name of person 2
          1
        
      
    
    
      
        
          3391122
          3255191
          THING-TYPE-1
          thing type 1 name 1
          1
        
      
    
    
      
        
          3255215
          3255198
          THING-TYPE-2
          thing type 2 name 1
          1
        
        
          3255227
          3255198
          THING-TYPE-2
          thing type 2 name 2
          1
        
      
    
  
  
    3391122
    THING-TYPE-1
    thing type 1 name 1
    thing_000745
  
  
    3255215
    THING-TYPE-2
    thing type 2 name 1
    thing_000783
  
  
    3255227
    THING-TYPE-2
    thing type 2 name 2
    thing_000803

aneroid · Accepted Answer

Question 1: How can only get a record in the output CSV file for the every of type , and not of all of the other types?

Add a check for that tag and its content - if it's not the one you want, then continue to the next iteration. Or process only the one's which match.

Question 2: What do I need to do differently in my Python to get name = i.find("name").text to return "name of Item 1"?

Use Element.findtext() and provide a default value for tags which aren't present for a row.

findtext (match, default=None, namespaces=None )

Finds text for the first subelement matching match. match may be a tag name or a path. Returns the text content of the first matching element, or default if no element was found. Note that if the matching element has no text content an empty string is returned. namespaces is an optional mapping from namespace prefix to full name. Pass '' as prefix to move all unprefixed tag names in the expression into the given namespace.

So provide a default value that you'd like if the tag is missing.

As for the child/nested elements, you need to use XPATHs. So instead of PERSON-TYPE-1-NAME.result.row.name, use PERSON-TYPE-1-NAME/result/row/name.

Putting these together:

for row in root:
    if row.findtext('type') != 'COLLECTION-ITEM':
        # skip if it's not what you're looking for
        continue
    name = row.findtext("name", "Missing name")
    item_id = row.findtext("ITEM-ID", "Missing item_id")
    attrib_1 = row.findtext("ATTRIB-1", "Missing attrib_1")
    p1_name = row.findtext("./PERSON-TYPE-1-NAME/result/row/name", "Missing p1_name")
    attrib_2 = row.findtext("ATTRIB-2", "Missing attrib_2")
    p2_name = row.findtext("./PERSON-TYPE-2-NAME/result/row/name", "Missing p2_name")
    relat_thing1_id = row.findtext("country", "Missing relat_thing1_id")
    relat_thing2_ids = row.findtext("country", "Missing relat_thing2_ids")
    
    rows.append({
        "Collection item": name,
        "ITEM-ID": item_id,
        "ATTRIB-1": attrib_1,
        "PERSON-TYPE-1-NAME": p1_name,
        "ATTRIB-2": attrib_2,
        "PERSON-TYPE-2-NAME": p2_name,
        "RELATED-THING-1 id": relat_thing1_id,
        "RELATED-THING-2 IDs": relat_thing2_ids
    })

print(rows)

Output:

[{'Collection item': 'name of Item 1',
  'ITEM-ID': 'item_000001',
  'ATTRIB-1': 'Yes',
  'PERSON-TYPE-1-NAME': 'name of person 1',
  'ATTRIB-2': 'Yes',
  'PERSON-TYPE-2-NAME': 'name of person 2',
  'RELATED-THING-1 id': 'Missing relat_thing1_id',
  'RELATED-THING-2 IDs': 'Missing relat_thing2_ids'}]

You can put "Missing" as the default value, or an empty string '', I've put explicit names above for clarity.

Are you using pandas just to write to CSV or do you actually plan on using pandas for data wrangling? If it's just to write to CSV, then use csv.DictWriter directly.

Python XML to CSV help for novice

Answers (1)

Related Questions