Pre-process an unformed XML (Java)

Question

the XML file I am working with is unformed and therefore invalid. It presents the following issues:

multiple XML declarations (error message: The processing instruction target matching "[xX][mM][lL]" is not allowed.
Absence of the root Element (error message: Extra content at the end of the document)

The file includes multiple records and this is an excerpt with two records:

  
    
      
    
    
  



  
    
      
    
    
  

In order to be well-formed and valid, the above document should be turned into this (please correct me if I am wrong):

Although I am aware that in the best of all possible worlds the data should be of high quality, unfortunately I will have to deal with a poor dataset and I am trying to find a good approach to achieve a well-formed and valid XML. At the moment, I have written 2 utility methods that remove all XML declarations (using the Pattern/Matcher for regex) and inject the only one required at the top the file and I am about to do something similar to remove any extra root node elements and only keep

I do not think this approach is particularly ideal and I fear it will be very much problematic when dealing with large files, can you help? Any recommendation, suggestion, potential approach would be much appreciated! I am really looking for a good approach for the scenario described.

Thank you so much,

I.

EDIT 1: As mentioned, the XML content is inside a .txt file and the 2 utility methods I wrote use the common BufferedReader to read its content. I am trying to do all the "data cleaning" before renaming the file with .xml extension (I have another utility that does that) and feeds it into a JaxB parser.

EDIT 2: Unfortunately, I have no control over the XML generation as I read the files directly from an FTP. It would be good to have control over how multiple XML get concatenate into the resulting one for which I have provided the excerpt, but it is not possible.

Michael Kay · Accepted Answer

Basically, your task is to write a parser for a grammar that has some similarities to the grammar for XML. Before you can write a parser for any grammar, you need to define what that grammar is: that is, specify what input your tool will accept, perhaps in terms of variations from the grammar of XML.

Of course, this will be expensive: the purpose of standardisation is to reduce costs so that everyone can use the same grammar and the same parsers, and if people use proprietary variations then life gets a lot more complicated for everyone.

So far, you're asking us to guess the grammar of your deviant XML by showing us a single example. Well, an example doesn't make a specification. More seriously, writing a parser for a language that hasn't been specified by continually extending it to handle more and more examples is not going to work: Sisyphus will finish his task before you do.

You should also bear in mind that the better you are at picking up other people's garbage, the more garbage they will throw at you.

Addendum

If in fact it is the case that your input file contains a sequence of well-formed XML documents concatenated into a single file, then the grammar of your input can actually be specified fairly easily. It's just one extra rule added to the XML specification:

file ::= document+

Perhaps with the modification that the XML declaration at the start of a document is mandatory.

So defining the grammar you want to accept may not be too difficult. But writing a parser that accurately accepts this grammar is still a challenge. The cleanest way to do it is probably to take an open-source XML parser and modify it.

There's no way of parsing this grammar with regular expressions: it is not a regular language (if you don't understand what this means, you shouldn't be writing parsers, but essentially it means that the definition of the grammar is recursive).

There are however some tricks you could use. Every document starts with , and the only places can occur are (a) at the start of a document, (b) in a comment, and (c) in a CDATA section. Comments and CDATA sections cannot be nested, so I think it's the case that every instance of your language will conform to the simpler grammar:



(`


where stuff is defined as anything that doesn't contain , , or

Pre-process an unformed XML (Java)

Answers (1)

Related Questions