n h
n h

Reputation: 33

Read single large file containing multiple XML files into multiple xml records in C#

I have a single file that effectively contains multiple XML files of the same format, so the file is not itself valid XML; for instance:

<?xml version='1.0' encoding='UTF-8'?>
<Proposal xmlns="a namespace">
    <ASubnode>Text</ASubNode>
    <LotsOfOtherNodes />
</Proposal>
<?xml version='1.0' encoding='UTF-8'?>
<Proposal xmlns="a namespace">
    <ASubnode>Text</ASubNode>
    <LotsOfOtherNodes />
</Proposal>
....

I would like to process all the Proposal nodes, one at a time; for example:

foreach (var proposal in file)
    do something

I cannot use XmlReader because it throws an exception upon reaching the intermediate XML declaration nodes. I could possibly read the entire file into a string and then use the Split method, but these files are Gigabytes in size, so that is not particularly attractive as an option. It might seem that I could read the file a line at a time, searching for the appropriate nodes via a regular expression, but the files are not line-formatted as above with one node per line, but rather contain very long lines of multiple nodes, and random line breaks in node text.

Is there a method of achieving this without hand-crafting a text parser?

Upvotes: 3

Views: 3283

Answers (2)

George Polevoy
George Polevoy

Reputation: 7681

You can read the text line by line without actually parsing the xml, since the header of an xml document is the same:

IEnumerable<XDocument> GetDocuments(Stream bulkStream)
{
    var reader = new StreamReader(bulkStream);
    var sb = new StringBuilder();   
    var firstLine = reader.ReadLine();
    string line = firstLine;    
    while(line != null)
    {
        sb.Clear();
        sb.Append(firstLine);
        while((line = reader.ReadLine()) != null && line != firstLine)
        {
            sb.Append(line);
        }

        yield return XDocument.Parse(sb.ToString());
    }
}

UPDATE: Following will work even if the declarations can start in-between of a line:

IEnumerable<XDocument> GetDocuments(Stream bulkStream)
{
    const string decl = @"<?xml version='1.0' encoding='UTF-8'?>";
    var sb = new StringBuilder();   

    bool start = true;
    foreach(var line in GetLines(bulkStream).Where(l => !string.IsNullOrWhiteSpace(l)))
    {
        if (start)
        {
            if (line == decl)
                start = false;
            sb.AppendLine(line);
        }
        else
        {
            if (line == decl)
            {
                sb.ToString().Dump();
                yield return XDocument.Parse(sb.ToString());

                sb.Clear();
                start = true;
                sb.AppendLine(line);
            }
            else
                sb.AppendLine(line);
        }
    }

    sb.ToString().Dump();
    yield return XDocument.Parse(sb.ToString());
}

IEnumerable<string> GetLines(Stream bulkStream)
{
    const string decl = @"<?xml version='1.0' encoding='UTF-8'?>";
    var reader = new StreamReader(bulkStream);
    string line;
    while((line = reader.ReadLine()) != null)
    {
        if (line.Contains(decl))
        {
            var declIndex = line.IndexOf(decl);
            yield return line.Substring(0, declIndex);
            yield return decl;
            yield return line.Substring(declIndex + decl.Length);
        }
        else
        {
            yield return line;
        }
    }
}

Upvotes: 0

Vlad Bezden
Vlad Bezden

Reputation: 89557

You have two options:

  1. Tell the XmlReader to not be so picky. Set the XmlReaderSettings.ConformanceLevel to ConformanceLevel.Fragment. This will let the parser ignore the fact that there is no root node.

    var settings = new XmlReaderSettings();
    settings.ConformanceLevel = ConformanceLevel.Fragment;
    using (var reader = XmlReader.Create(textReader, settings))
    {
         ...
    }
    
  2. Wrap your XML file with your 'root' element, this way your document will have only one root node

 <?xml version='1.0' encoding='UTF-8'?>
 <root>
     <Proposal xmlns="a namespace">
         <ASubnode>Text</ASubNode>
         <LotsOfOtherNodes />
     </Proposal>
     <?xml version='1.0' encoding='UTF-8'?>
     <Proposal xmlns="a namespace">
         <ASubnode>Text</ASubNode>
         <LotsOfOtherNodes />
     </Proposal>
 ....
 </root>

Upvotes: 2

Related Questions