user17753
user17753

Reputation: 3161

xml xsd validation and malformed xml

I wrote a quick class to validate an XML file at a FilePath against an XSD with .NET (see below).

I have large volumes of data files being generated by another machine on the LAN, but the files are not true XML, they are malformed, but in the same way every time and based on their structure I can make some global replaces on the content of the file to correct it. So I have to correct these before testing with XSD. I have to replace <\ with </ and so on. All the replaces are listed in the code.

When I point this to the LAN network share of the machine generating the files at a list of about 50k files, and this took about 15 minutes to complete. I'm wondering if this is just IO capped by the LAN, or if there's a better (quicker) way to correct the malformed XML than the replaces I do here.

class VCheck
{
    private static XmlReaderSettings settings = new XmlReaderSettings();
    private bool valid;
    string message;
    public string Message { get { return message; } }

    public VCheck()
    {
        settings.ValidationType = ValidationType.Schema;
        settings.ValidationFlags |= XmlSchemaValidationFlags.ReportValidationWarnings;
        settings.ValidationEventHandler += new ValidationEventHandler(ValidationCallBack);
        settings.Schemas.Add(null, "schema.xsd");
    }

    public bool CheckFile(string FileFullPath) 
    {
        StreamReader file = new StreamReader(FileFullPath);
        valid = true;
        message = null;
        try
        { //setup xml reader with settings
            XmlReader xml = XmlReader.Create(new StringReader(@"<?xml version='1.0'?><root xmlns=""MYE"">" + 
            file.ReadToEnd().Replace(@"<\", @"</").Replace("&", "&amp;").Replace("\"", "&quot;").Replace("'", "&apos;") + "</root>"), 
            settings);

            while (xml.Read()) ; //read in all xml, validating against xsd
        }
        catch
        {
            //problem reading the xml file in, bad path, disk error etc.
            return false;
        }

        return valid;
    }

    void ValidationCallBack(object sender, ValidationEventArgs e) //called on failed validations
    {
        valid = false;
        message = e.Message;
        switch (e.Severity)
        {
            case XmlSeverityType.Error:
                //Do stuff on validation error
                break;
            case XmlSeverityType.Warning:
                //Do stuff on validation warning
                break;
        }

    }

}

I'd call it from main like this:

    static void Main(string[] args)
    {
        VCheck checker = new VCheck();
        foreach (string file in files) //files is a List<string> of file paths/names
        {
            if (!checker.CheckFile(file))
            {
                //To do stuff if not valid
            }
        }
}

Upvotes: 0

Views: 974

Answers (2)

C. M. Sperberg-McQueen
C. M. Sperberg-McQueen

Reputation: 25034

The quickest processes are those which do not need to be performed at all. So I commend Michael Kay's comments on dealing with "non-well-formed XML" to your attention.

If the non-XML data you'd like to handle as XML is being generated by a machine, there's no reason that that machine could not be generating XML data instead of the non-XML data you're currently trying to fix. Worse, every minute of effort you put into dealing with the errors in the data-producing process is a minute you've put into persuading those responsible for that process that they are producing correct, well-formed XML. So it's not only yourself you're hurting here.

Upvotes: 0

Konrad Morawski
Konrad Morawski

Reputation: 8394

I don't think reading it all into memory - ReadToEnd - and performing String.Replace on the contents is a good choice, with regard to your performance concerns.

If I were you, I'd rather rewrite those files "piece by piece" - that is, buffering and replacing data on the fly.

Just create a new file, load some of the malformed file into the buffer (say 4 kb), do the replacements, flush the results into your newly created file; rinse and repeat.

Beware: it can happen that one buffer ends with < and next one starts with \. If you want not to miss any <\s (and the like), you need to handle such cases as well.

Another possible solution is that you could try and create your own implementation of a "more tolerant" XmlReader (this class is not sealed, so you can base on it and create your own), although personally I haven't done it and I'm not sure this would be a good approach. Rewriting the files will at least leave you with syntactically valid XML, which may come in useful at some point.


PS. On a side note:

    catch
    {
        //problem reading the xml file in, bad path, disk error etc.
        return false;
    }

I wouldn't do that. It leaves the caller with no idea whatsoever as for why the operation failed.

Upvotes: 1

Related Questions