Dexter
Dexter

Reputation: 21

Automatically correct invalid XML?

I am currently using SSIS on a project where I need to verify the correct XML file structure. In particular, I have to check that there is no missing tag in the XML file and if so, I have to reassemble this line without tag. I'll give you an example to better understand.

<?xml version="1.0"?>
<catalog>
<DATA>0000000061E82D821590010000409525CD</DATA> 
<DATA>0000000061E82D8C163001000140AD0DF6</DATA> 
<DATA>0000000061E82D9616E301000240776CAB</DATA>
<DATA> 0000000061E82DA0178001000340C56B6</DATA> 
<DATA>0000000061E82DAA188001000440C0C7CB</DATA>
 0000000061E82DDAEA4001000540BB9A276
</catalog>

For example in the above XML there is a <DATA> tag missing. I have no influence on the creation of the XML. How could I notice that a <DATA> tag is missing (the number of data lines is not fixed), and subsequently retrieve that line where there is no tag ?

For example in the above xml there is a <DATA> tag missing. I have no influence on the creation of the XML.

The solution can be a suite of SSIS components or a c# script.

Upvotes: 0

Views: 932

Answers (1)

kjhughes
kjhughes

Reputation: 111491

It is impossible to automatically correct invalid XML in the general case.

Terminology correction

For example in the above XML there is a <DATA> tag missing.

There is not a <DATA> tag missing. You probably mean that there are supposed to be begin and end DATA tags surrounding 0000000061E82DDAEA4001000540BB9A276. The difference is significant because if there were only a single tag missing, the "XML" would not be well-formed. If a schema says that a catalog element may only have DATA children, then the XML is not valid.

Don't try to automatically correct invalid XML

Best practice is to reject the input and force the sender/creator to fix the document. The entire raison d'être for a schema is to express the invariants that can be relied upon to process the data. Violating those invariants means all bets are off.

Don't be seduced by the superficial simplicity of peep-hole repair ideas

Every repair idea implies an assumption about the data that is not expressed in the schema, which would be bad because:

  • There should be a clearly and explicitly expressed definition of validity and
  • The assumptions
    • will likely not be expressed unambiguously.
    • may not be expressed at all.
    • may be incomplete or entirely incorrect.
    • will probably go unconfirmed because an errant producer that can/will not fix validity against a schema is unlikely to be able to assess the validity of an assumption over all data that it is, or could be, sending over all time.

See also

Upvotes: 1

Related Questions