Simon Kiely
Simon Kiely

Reputation: 6050

How can I validate my 3,000,000 line long XML file?

I have an XML file. It is nearly correct, but it is not.

Error on line 302211.
Extra Content at the end of the document.

I've spent literally two days trying to debug this, but the file is so big it's nearly impossible. Is there anything I can do ?

Here are the relevant lines also (I include 2 lines before the error code, the error begins on the <seg> tag).

 <tu>
   <tuv xml:lang="en"> 
    <prop type="feed"></prop>
    <seg>
        <bpt i="1" x="1" type="feed">
            test
        </bpt>
        To switch on computer:
        <ept i="1">
            &gt;
        </ept>
        Press device 
        <ph x="2" type="feed">
            &lt;schar _TR=&quot;123&quot; y.io.name
        </ph> or 
        <ph x="3" type="feed">
            &lt;schar _TR=&quot;274&quot; y.io.name=&quot;
        </ph> (Spain) twice. 
    </seg>
 </tuv>
</tu>

Can anyone give me some pointers on finding the issue here? I am using the Notepad++ XML plugin.

Upvotes: 0

Views: 544

Answers (1)

kjhughes
kjhughes

Reputation: 111686

Background notes

  • The XML fragment you've posted stands on its own as a well-formed XML document – the problem must be somewhere else in your XML.
  • Your particular XML problem is well-formedness, not validity.

Tips for finding XML well-formedness problems

  1. Use an XML parser with better diagnostic messages. Xerces-based tools have very good messages (albeit with a few exceptions).
  2. Know the common problems that cause an XML document not to be well-formed:
  3. Divide and conquer. Consider this sketch of a huge XML document:

    <root>
       <First>
           <FirstChild>
              <!-- Tons of descendent markup -->
           </FirstChild>
           <SecondChild>
              <!-- Tons of descendent markup -->
           </SecondChild>
       </First>
       <Second>
           <!-- Tons of descendent markup -->
       </Second>
    </root>
    

    Process of elimination:

    1. Delete the First element.
    2. Revalidate.
    3. If error goes away, restore First element and remove Second element.
    4. Else, remove FirstChild element.
    5. Repeat until error can be more easily spotted in the reduced XML document.

See also

Upvotes: 3

Related Questions