Reputation: 21
I am currently using SSIS on a project where I need to verify the correct XML file structure. In particular, I have to check that there is no missing tag in the XML file and if so, I have to reassemble this line without tag. I'll give you an example to better understand.
<?xml version="1.0"?>
<catalog>
<DATA>0000000061E82D821590010000409525CD</DATA>
<DATA>0000000061E82D8C163001000140AD0DF6</DATA>
<DATA>0000000061E82D9616E301000240776CAB</DATA>
<DATA> 0000000061E82DA0178001000340C56B6</DATA>
<DATA>0000000061E82DAA188001000440C0C7CB</DATA>
0000000061E82DDAEA4001000540BB9A276
</catalog>
For example in the above XML there is a <DATA>
tag missing. I have no influence on the creation of the XML.
How could I notice that a <DATA>
tag is missing (the number of data lines is not fixed), and subsequently retrieve that line where there is no tag ?
For example in the above xml there is a <DATA>
tag missing. I have no influence on the creation of the XML.
The solution can be a suite of SSIS components or a c# script.
Upvotes: 0
Views: 932
Reputation: 111491
It is impossible to automatically correct invalid XML in the general case.
Terminology correction
For example in the above XML there is a
<DATA>
tag missing.
There is not a <DATA>
tag missing. You probably mean that there are supposed to be begin and end DATA
tags surrounding 0000000061E82DDAEA4001000540BB9A276
. The difference is significant because if there were only a single tag missing, the "XML" would not be well-formed. If a schema says that a catalog
element may only have DATA
children, then the XML is not valid.
See Well-formed vs Valid XML for a detailed description of this important distinction.
Don't try to automatically correct invalid XML
Best practice is to reject the input and force the sender/creator to fix the document. The entire raison d'être for a schema is to express the invariants that can be relied upon to process the data. Violating those invariants means all bets are off.
Don't be seduced by the superficial simplicity of peep-hole repair ideas
Every repair idea implies an assumption about the data that is not expressed in the schema, which would be bad because:
Upvotes: 1