Reputation: 94
****SEE EDIT portion Below:
Thanks for looking into this issue. I am not sure, whether this is the right forum to post this thread. If not, let me know the right forum to post this thread.
We have a complex XML Message (data in XML format). We are exploring a way to extract all the XPATHs of this XML message and its element/attribute level data content. We tried with XMLSPY, & xmltwig, but no luck. Xml_grep pulls data, if we give XPATH input. There is no option in xml_grep to browse all XPATHS of a XML message.
I have well-formed XML message. I want to produce a list/report as
All Xpath of XML message (Browse all XPATH and list of XML message)
Xpath , data content for this XPATH (Browse all XPATH, data content and list both of XML message)
Here is an example (Input XML Message)
<?xml version="1.0"?>
<PARTS>
<TITLE>Computer Parts</TITLE>
<PART>
<ITEM>Motherboard</ITEM>
<MANUFACTURER>ASUS</MANUFACTURER>
<MODEL>P3B-F</MODEL>
<COST> 123.00</COST>
</PART>
<PART>
<ITEM>Video Card</ITEM>
<MANUFACTURER>ATI</MANUFACTURER>
<MODEL>All-in-Wonder Pro</MODEL>
<COST> 160.00</COST>
</PART>
<PART>
<ITEM>Sound Card</ITEM>
<MANUFACTURER>Creative Labs</MANUFACTURER>
<MODEL>Sound Blaster Live</MODEL>
<COST> 80.00</COST>
</PART>
<PART>
<ITEM>inch Monitor</ITEM>
<MANUFACTURER>LG Electronics</MANUFACTURER>
<MODEL> 995E</MODEL>
<COST> 290.00</COST>
</PART>
</PARTS>
The desired output --> I created the following XML list manually
/PARTS/TITLE Computer Parts
/PARTS/PART[1]/ITEM Motherboard
/PARTS/PART[1]/MANUFACTURER ASUS
/PARTS/PART[1]/MODEL P3B-F
/PARTS/PART[1]/COST 123.00
/PARTS/PART[2]/ITEM Video Card
/PARTS/PART[2]/MANUFACTURER ATI
............
..............
..................
...................
Are there any open source product to produce such report for XML Message?
What are the ways to extract XPATHs/XPATH, data content?
Thanks for allowing to pick the brain of this forum.
+++++
Thanks. The above code output
Field|Value
/*|
/*/*[1]|X
/*/*[2]|000000000
/*/*[3]|000000000
/*/*[4]|&
/*/*[5]|
I am not able to get text xpath
Here is the input xml
<CorrectedW2Ind>X</CorrectedW2Ind>
<EmployeeSSN>000000000</EmployeeSSN>
<EmployerEIN>000000000</EmployerEIN>
<EmployerNameControlTxt>&</EmployerNameControlTxt>
<EmployerName>
<BusinessNameLine1Txt>#</BusinessNameLine1Txt>
<BusinessNameLine2Txt>#</BusinessNameLine2Txt>
</EmployerName>
<EmployerUSAddress>
<AddressLine1Txt>0</AddressLine1Txt>
<AddressLine2Txt>0</AddressLine2Txt>
<CityNm>A</CityNm>
<StateAbbreviationCd>PW</StateAbbreviationCd>
<ZIPCd>00000</ZIPCd>
</EmployerUSAddress>
<EmployersUseGrp>
<EmployersUseCd>A</EmployersUseCd>
<PriorUSERRAContributionYr>00</PriorUSERRAContributionYr>
<EmployersUseAmt>0</EmployersUseAmt>
</EmployersUseGrp>
<EmployersUseGrp>
<EmployersUseCd>A</EmployersUseCd>
<PriorUSERRAContributionYr>00</PriorUSERRAContributionYr>
<EmployersUseAmt>0</EmployersUseAmt>
</EmployersUseGrp>
<EmployersUseGrp>
<EmployersUseCd>A</EmployersUseCd>
<PriorUSERRAContributionYr>00</PriorUSERRAContributionYr>
<EmployersUseAmt>0</EmployersUseAmt>
</EmployersUseGrp>
<EmployersUseGrp>
<EmployersUseCd>A</EmployersUseCd>
<PriorUSERRAContributionYr>00</PriorUSERRAContributionYr>
<EmployersUseAmt>0</EmployersUseAmt>
</EmployersUseGrp>
<EmployersUseGrp>
<EmployersUseCd>A</EmployersUseCd>
<PriorUSERRAContributionYr>00</PriorUSERRAContributionYr>
<EmployersUseAmt>0</EmployersUseAmt>
</EmployersUseGrp>
a) What is the lxml method to use , to get value, Xpath (text) using above code?
b) What is the lxml method to use, to get repeating group node aggration?
like Xpath of EmployersUseGrp ====> 5
EDIT ===== 6/26/2019 ========================
I am not able to open new questions. I am getting question limit exceeded message. I am posting the follow up to this code here.
I am trying to use the posted pyhton code answer. I am getting weird output.
I have a large XML file like (inputf.xml). I used this file as input = inputf.xml in posted code
<?xml version="1.0" encoding="UTF-8"?>
<DataFileFor>
<DataR>
<Id>5070022019330a0050hq</Id>
<NUM>30221730001019</NUM>
<Postmark>2020-01-03T09:25:57.000-05:00</Postmark>
<TNO>47647</TNO>
.
.
.
.
.
</DataFileFor>
++++
When grab the XPATH of Node using xml_grep, I am getting.
xml_grep DataFileFor/DataR/Ret/W2 inputf.xml ===> output
<?xml version="1.0" ?>
<xml_grep version="0.7" date="Fri Jun 26 13:07:11 2020">
<file filename="inputf.xml">
<W2 Id="W2" dName="W2" sId="00000000" sVersionNum="String">
<CorrectedW2Ind>X</CorrectedW2Ind>
<EmployeeSSN>000000000</EmployeeSSN>
<EmployerEIN>000000000</EmployerEIN>
<EmployerNameControlTxt>S</EmployerNameControlTxt>
<EmployerName>
<BusinessNameLine1Txt>String</BusinessNameLine1Txt>
<BusinessNameLine2Txt>String</BusinessNameLine2Txt>
</EmployerName>
<EmployerUSAddress>
<AddressLine1Txt>String</AddressLine1Txt>
<AddressLine2Txt>String</AddressLine2Txt>
<CityNm>String</CityNm>
<StateAbbreviationCd>AL</StateAbbreviationCd>
<ZIPCd>000000000</ZIPCd>
.
.
.
.
.
</W2>
When I use this code, it is not producing readable Xpaths. The output XPATHS are like
/DataFileFor/DataR/*[8]/*[2]/*[6]/*[3]/*[10]|X
/DataFileFor/DataR/*[8]/*[2]/*[6]/*[3]/*[11]|00000000
/DataFileFor/DataR/*[8]/*[2]/*[6]/*[3]/*[12]|00000000
/DataFileFor/DataR/*[8]/*[2]/*[6]/*[3]/*[13]|S
/DataFileFor/DataR/*[8]/*[2]/*[6]/*[3]/*[14]|String
The attributes
Id="W2" dName="W2" sId="00000000" sVersionNum="String"> are not showing up in the output
What are the changes required to the code, to fix this?
Thanks for your guidance.
Upvotes: 0
Views: 615
Reputation: 61
Just seen this, i wrote something that did this in python - outputs to csv, pipe delimited. Feel free to use it. Happy to answer any questions but don't expect immediate response.
from lxml import etree, objectify
def parseXML(xmlFile, outputFile):
"""
Parse the XML function
"""
with open(xmlFile) as fobj:
xml = fobj.read()
f = open(outputFile,'w') #open write to file
root = etree.fromstring(xml)
f.write("%s|%s\n" %("Field", "Value"))
tree = etree.ElementTree(root)
for e in root.iter():
f.write("%s|%s\n" %(tree.getpath(e), e.text))
f.close()
if __name__ == "__main__":
print ('Loading variables...')
input = '16a.xml'
output = input + '.csv'
parseXML(input,output)
Upvotes: 1