Reputation: 33
I have a single file that effectively contains multiple XML files of the same format, so the file is not itself valid XML; for instance:
<?xml version='1.0' encoding='UTF-8'?>
<Proposal xmlns="a namespace">
<ASubnode>Text</ASubNode>
<LotsOfOtherNodes />
</Proposal>
<?xml version='1.0' encoding='UTF-8'?>
<Proposal xmlns="a namespace">
<ASubnode>Text</ASubNode>
<LotsOfOtherNodes />
</Proposal>
....
I would like to process all the Proposal nodes, one at a time; for example:
foreach (var proposal in file)
do something
I cannot use XmlReader because it throws an exception upon reaching the intermediate XML declaration nodes. I could possibly read the entire file into a string and then use the Split method, but these files are Gigabytes in size, so that is not particularly attractive as an option. It might seem that I could read the file a line at a time, searching for the appropriate nodes via a regular expression, but the files are not line-formatted as above with one node per line, but rather contain very long lines of multiple nodes, and random line breaks in node text.
Is there a method of achieving this without hand-crafting a text parser?
Upvotes: 3
Views: 3283
Reputation: 7681
You can read the text line by line without actually parsing the xml, since the header of an xml document is the same:
IEnumerable<XDocument> GetDocuments(Stream bulkStream)
{
var reader = new StreamReader(bulkStream);
var sb = new StringBuilder();
var firstLine = reader.ReadLine();
string line = firstLine;
while(line != null)
{
sb.Clear();
sb.Append(firstLine);
while((line = reader.ReadLine()) != null && line != firstLine)
{
sb.Append(line);
}
yield return XDocument.Parse(sb.ToString());
}
}
UPDATE: Following will work even if the declarations can start in-between of a line:
IEnumerable<XDocument> GetDocuments(Stream bulkStream)
{
const string decl = @"<?xml version='1.0' encoding='UTF-8'?>";
var sb = new StringBuilder();
bool start = true;
foreach(var line in GetLines(bulkStream).Where(l => !string.IsNullOrWhiteSpace(l)))
{
if (start)
{
if (line == decl)
start = false;
sb.AppendLine(line);
}
else
{
if (line == decl)
{
sb.ToString().Dump();
yield return XDocument.Parse(sb.ToString());
sb.Clear();
start = true;
sb.AppendLine(line);
}
else
sb.AppendLine(line);
}
}
sb.ToString().Dump();
yield return XDocument.Parse(sb.ToString());
}
IEnumerable<string> GetLines(Stream bulkStream)
{
const string decl = @"<?xml version='1.0' encoding='UTF-8'?>";
var reader = new StreamReader(bulkStream);
string line;
while((line = reader.ReadLine()) != null)
{
if (line.Contains(decl))
{
var declIndex = line.IndexOf(decl);
yield return line.Substring(0, declIndex);
yield return decl;
yield return line.Substring(declIndex + decl.Length);
}
else
{
yield return line;
}
}
}
Upvotes: 0
Reputation: 89557
You have two options:
Tell the XmlReader to not be so picky. Set the XmlReaderSettings.ConformanceLevel to ConformanceLevel.Fragment. This will let the parser ignore the fact that there is no root node.
var settings = new XmlReaderSettings();
settings.ConformanceLevel = ConformanceLevel.Fragment;
using (var reader = XmlReader.Create(textReader, settings))
{
...
}
Wrap your XML file with your 'root' element, this way your document will have only one root node
<?xml version='1.0' encoding='UTF-8'?> <root> <Proposal xmlns="a namespace"> <ASubnode>Text</ASubNode> <LotsOfOtherNodes /> </Proposal> <?xml version='1.0' encoding='UTF-8'?> <Proposal xmlns="a namespace"> <ASubnode>Text</ASubNode> <LotsOfOtherNodes /> </Proposal> .... </root>
Upvotes: 2