Thedornman
Thedornman

Reputation: 43

Parsing XML with ElementTree's iter() with no argument, does not return the first several tags in file

I am trying to extract all of the headers from an XML file and put them into a list in python, however, every time I run my code the first tag extracted from the file is not actually first tag in the XML file. It instead begins with the 18th tag and then prints the remainder of the list from there. The really weird part is when I originally wrote this code, it worked as expected, but as I added code to extract the element text and put it in a list, the header code stopped working, both in the original program and the standalone code below. I should also mention the complete program does not manipulate the XML file in any way. All manipulation is done exclusively on the python lists after the extraction.

import xml.etree.ElementTree as ET

tree = ET.parse("Sample.xml")
root = tree.getroot()

headers = [elem.tag for elem in root.iter()]

print(headers)

Sample.XML is a sensitive file so I had to redact all the element text. It is also a very large file so I only included one account's worth of elements.

-<ExternalCollection xmlns="namespace.xsd">
    -<Batch>
        <BatchID>***</BatchID>
        <ExternalCollectorName>***</ExternalCollectorName>
        <PrintDate>***</PrintDate>
        <ProviderOrganization>***</ProviderOrganization>
        <ProvOrgID>***</ProvOrgID>
       -<Account>
           <AccountNum>***</AccountNum>
           <Guarantor>***</Guarantor>
           <GuarantorAddress1>***</GuarantorAddress1>
           <GuarantorAddress2/>
           <GuarantorCityStateZip>***</GuarantorCityStateZip>
           <GuarantorEmail/>
           <GuarantorPhone>***</GuarantorPhone>
           <GuarantorMobile/>
           <GuarantorDOB>***</GuarantorDOB>    
           <AccountID>***</AccountID>
           <GuarantorID>***</GuarantorID>
          -<Incident>
               <Patient>***</Patient>
               <PatientDOB>***</PatientDOB>
               <FacilityName>***</FacilityName>
              -<ServiceLine>
                  <DOS>***</DOS>
                  <Provider>***</Provider>
                  <Code>***</Code>
                  <Modifier>***</Modifier>
                  <Description>***</Description>
                  <Billed>***</Billed>
                  <Expected>***</Expected>
                  <Balance>***</Balance>
                  <SelfPay>***</SelfPay>
                  <IncidentID>***</IncidentID>
                  <ServiceLineID>***</ServiceLineID>
                 -<OtherActivity>  
                  </OtherActivity>
              </ServiceLine>
          </Incident>
      </Account>
  </Batch>
  </ExternalCollection>

The output is as follows:

 'namespace.xsd}PatientDOB', '{namespace.xsd}FacilityName', '{namespace.xsd}ServiceLine', '{namespace.xsd}DOS', '{namespace.xsd}Provider', '{namespace.xsd}Code', '{namespace.xsd}Modifier', '{namespace.xsd}Description', '{namespace.xsd}Billed', '{namespace.xsd}Expected', '{namespace.xsd}Balance', '{namespace.xsd}SelfPay', '{namespace.xsd}IncidentID', '{namespace.xsd}ServiceLineID', '{namespace.xsd}OtherActivity'

As you can see, for some reason the first returned value is Patient DOB instead of the actual first tag.

Thank y'all in advance!

Upvotes: 1

Views: 820

Answers (1)

Valdi_Bo
Valdi_Bo

Reputation: 31011

Your input file should not contain "-" chars in front of XML tags. You should drop at least the first "-", in front of the root tag, otherwise a parsing error occurs.

Note also that your first printed tag name has no initial "{", so apparently something weird is going on with your list, presumably, after your loop.

I ran your code and got a proper list, containing all tags.

Try the following loop:

for elem in root.iter():
    print(elem.tag)

Maybe it will give you some clue about the real cause of your error.

Consider also upgrading your Python installation. Maybe you have some outdated modules.

Yet another hint: Run your code on just this input that you included in your post, with content replaced with "***". Maybe the real cause of your error is in the actual content of any source element (which you replaced here with asterixes).

Upvotes: 1

Related Questions