Reputation: 648
I have a file which has xml like tags and a bunch of invalid xml data because of which I cannot use a normal xml validators like xmllint on the the file. I want to ignore the invalid xml data and just check the file for well formedness.
<?xml version="1.0" encoding="utf-8"?>
<HOST>
<VERSION>5</VERSION>
<OUTPUT>
bunch of text which also contains tags like <SYSTEM>
more tags like <-> <temp> & ;
some more text and numbers
</OUTPUT>
</HOST>
In the above example can I just ignore tags like <system>, <->, &, ; etc and just check for valid opening and closing tags like <HOST> </HOST> <VERSION> </VERSION> and <OUTPUT> </OUTPUT>. The above file should return back saying its well formed since all the valid tags have proper opening and closing brackets.
Can I create my own dtd/xsd ?? to look for the tags which I want and ignore rest of tags using Perl.
My main problem is that I dont know the right keywords to describe my problem which is why google is not giving me the right results. Can someone please push me in the right direction. Thanks
Upvotes: 2
Views: 172
Reputation: 16171
May I ask what's the point? Your input file is not XML, and you don't want to make it XML by adding CDATA section. What do you gain by knowing whether "some" of the data is XML? It's not like you will be able to use XML tools on it, or that ypo will be able to deliver it as XML.
So really this non-validation doesn't gain you anything. Isn't it a bit of a waste of time then?
Upvotes: 2
Reputation: 4005
You'll have to clean up the input first. Once you do that, then you can do DTD, schemas, proper parsing, and whatever.
If it's just the OUTPUT
tag, you can try this:
s/(<OUTPUT>)/$1<![CDATA[/;
s/(</OUTPUT>)/]]>$1/;
After that is done, your input should be ready for XML parsing, validation, etc. If your input might contain CDATA sections, you'll have to do more, but that should be enough to get started.
Upvotes: 1