Reputation: 49
I have a large-ish XML file. I include a scrubbed snippet of it below. The result-size
below is the number of elements in the XML file that the file calls <row>
.
Question 1: How can only get a record in the output CSV file for the every <row>
of type <COLLECTION-ITEM>
, and not of all of the other types? I have no control over how the XML is structured. If you comment out all but the test values in my Python code, you'll see that file is outputting 5 records when I only want 1.
Question 2: What do I need to do differently in my Python to get name = i.find("name").text
to return "name of Item 1"?
I think it's fair for the answer to just focus on those first two questions. I hope that getting those answers puts me on a path to solving the rest of my problem with this XML to CSV issue. Here is the big picture and a couple more issues I will need to solve, though. Point me to books, classes, whatever will help. I only have a week or so to make this work.
Desired output:
Collection item,ITEM-ID,ATTRIB-1,PERSON-TYPE-1-NAME,ATTRIB-2,PERSON-TYPE-2-NAME,RELATED-THING-1 id,RELATED-THING-2 IDs
name of Item 1,item_000001,Yes,name of person 1,Yes,name of person 2,thing_000745,"thing_000783, thing_000803"
I can read in the file and output a CSV file with my designated column names. But I can't get the name of even the first thing into the CSV.
Then there are the more complicated bits that will require functions, such as finding an ID attribute based on matching the type and name attributes, and returning more than one ID. See RELATED-THING-1-ID and RELATED-THING-2-IDs.
Here is my Python (based on https://www.geeksforgeeks.org/convert-xml-to-csv-in-python/):
# -*- coding: utf-8 -*-
# Importing the required libraries
import xml.etree.ElementTree as Xet
import pandas as pd
cols = ["Collection item", "ITEM-ID", "ATTRIB-1", "PERSON-TYPE-1-NAME" ,
"ATTRIB-2", "PERSON-TYPE-2-NAME", "RELATED-THING-1 id",
"RELATED-THING-2 IDs"]
rows = []
# Parsing the XML file
xmlparse = Xet.parse('sample.xml')
root = xmlparse.getroot()
for i in root:
name = i.find("name").text
item_id = i.find("ITEM-ID").text
attrib_1 = i.find("ATTRIB-1").text
p1_name = i.find("PERSON-TYPE-1-NAME.result.row.name").text
attrib_2 = i.find("ATTRIB-2").text
p2_name = i.find("PERSON-TYPE-2-NAME.result.row.name").text
relat_thing1_id = i.find("country").text
relat_thing2_ids = i.find("country").text
rows.append({"Collection item": name,
"ITEM-ID": item_id,
"ATTRIB-1": attrib_1,
"PERSON-TYPE-1-NAME": p1_name,
"ATTRIB-2": attrib_2,
"PERSON-TYPE-2-NAME": p2_name,
"RELATED-THING-1 id": relat_thing1_id,
"RELATED-THING-2 IDs": relat_thing2_ids
})
df = pd.DataFrame(rows, columns=cols)
# Writing dataframe to csv
df.to_csv('output.csv')
This is the XML:
<?xml version="1.0" encoding="UTF-8"?>
<result size="4321">
<row>
<id>3255183</id>
<type>CONTEXT</type>
<name>collections</name>
</row>
<row>
<id>3652889</id>
<type>COLLECTION-ITEM</type>
<name>name of Item 1</name>
<ITEM-ID>item_000001</ITEM-ID>
<ATTRIB-1>Yes</ATTRIB-1>
<PERSON-TYPE-1-NAME>
<result size="1">
<row>
<id>3254728</id>
<scopeId>3254388</scopeId>
<type>PERSON</type>
<name>name of person 1</name>
<no>1</no>
</row>
</result>
</PERSON-TYPE-1-NAME>
<ATTRIB-2>Yes</ATTRIB-2>
<PERSON-TYPE-2-NAME>
<result size="1">
<row>
<id>3254403</id>
<scopeId>3254388</scopeId>
<type>PERSON</type>
<name>name of person 2</name>
<no>1</no>
</row>
</result>
</PERSON-TYPE-2-NAME>
<RELATED-THING-1>
<result size="1">
<row>
<id>3391122</id>
<scopeId>3255191</scopeId>
<type>THING-TYPE-1</type>
<name>thing type 1 name 1</name>
<no>1</no>
</row>
</result>
</RELATED-THING-1>
<RELATED-THING-2>
<result size="2">
<row>
<id>3255215</id>
<scopeId>3255198</scopeId>
<type>THING-TYPE-2</type>
<name>thing type 2 name 1</name>
<no>1</no>
</row>
<row>
<id>3255227</id>
<scopeId>3255198</scopeId>
<type>THING-TYPE-2</type>
<name>thing type 2 name 2</name>
<no>1</no>
</row>
</result>
</RELATED-THING-2>
</row>
<row>
<id>3391122</id>
<type>THING-TYPE-1</type>
<name>thing type 1 name 1</name>
<THING-ID>thing_000745</THING-ID>
</row>
<row>
<id>3255215</id>
<type>THING-TYPE-2</type>
<name>thing type 2 name 1</name>
<THING-ID>thing_000783</THING-ID>
</row>
<row>
<id>3255227</id>
<type>THING-TYPE-2</type>
<name>thing type 2 name 2</name>
<THING-ID>thing_000803</THING-ID>
</row>
</result>
Upvotes: 1
Views: 158
Reputation: 15962
Question 1: How can only get a record in the output CSV file for the every <row>
of type <COLLECTION-ITEM>
, and not of all of the other types?
Add a check for that tag and its content - if it's not the one you want, then continue
to the next iteration. Or process only the one's which match.
Question 2: What do I need to do differently in my Python to get name = i.find("name").text
to return "name of Item 1"?
Use Element.findtext()
and provide a default value for tags which aren't present for a row.
findtext
(match, default=None, namespaces=None )Finds text for the first subelement matching match. match may be a tag name or a path. Returns the text content of the first matching element, or default if no element was found. Note that if the matching element has no text content an empty string is returned. namespaces is an optional mapping from namespace prefix to full name. Pass
''
as prefix to move all unprefixed tag names in the expression into the given namespace.
So provide a default value that you'd like if the tag is missing.
As for the child/nested elements, you need to use XPATHs. So instead of PERSON-TYPE-1-NAME.result.row.name
, use PERSON-TYPE-1-NAME/result/row/name
.
Putting these together:
for row in root:
if row.findtext('type') != 'COLLECTION-ITEM':
# skip if it's not what you're looking for
continue
name = row.findtext("name", "Missing name")
item_id = row.findtext("ITEM-ID", "Missing item_id")
attrib_1 = row.findtext("ATTRIB-1", "Missing attrib_1")
p1_name = row.findtext("./PERSON-TYPE-1-NAME/result/row/name", "Missing p1_name")
attrib_2 = row.findtext("ATTRIB-2", "Missing attrib_2")
p2_name = row.findtext("./PERSON-TYPE-2-NAME/result/row/name", "Missing p2_name")
relat_thing1_id = row.findtext("country", "Missing relat_thing1_id")
relat_thing2_ids = row.findtext("country", "Missing relat_thing2_ids")
rows.append({
"Collection item": name,
"ITEM-ID": item_id,
"ATTRIB-1": attrib_1,
"PERSON-TYPE-1-NAME": p1_name,
"ATTRIB-2": attrib_2,
"PERSON-TYPE-2-NAME": p2_name,
"RELATED-THING-1 id": relat_thing1_id,
"RELATED-THING-2 IDs": relat_thing2_ids
})
print(rows)
Output:
[{'Collection item': 'name of Item 1',
'ITEM-ID': 'item_000001',
'ATTRIB-1': 'Yes',
'PERSON-TYPE-1-NAME': 'name of person 1',
'ATTRIB-2': 'Yes',
'PERSON-TYPE-2-NAME': 'name of person 2',
'RELATED-THING-1 id': 'Missing relat_thing1_id',
'RELATED-THING-2 IDs': 'Missing relat_thing2_ids'}]
You can put "Missing" as the default value, or an empty string ''
, I've put explicit names above for clarity.
Are you using pandas just to write to CSV or do you actually plan on using pandas for data wrangling? If it's just to write to CSV, then use csv.DictWriter
directly.
Upvotes: 1