Reputation: 29441
I work with big XML Files (~2Go), up to now, the reading was done this way:
private void readParameters(XmlReader m, Measurement meas)
{
while (m.ReadToFollowing("PAR"))
{
XmlReader par = m.ReadSubtree();
readParameter(par, meas);
par.Close();
((IDisposable)par).Dispose();
}
}
Which went well, but was slooooow. So I bring my science in, tried to parallelize the reading:
private void readParameters(XmlReader m, Measurement meas)
{
List<XmlReader> readers = new List<XmlReader>();
while (m.ReadToFollowing("PAR"))
{
readers.Add(m.ReadSubtree());
}
Parallel.ForEach(readers, reader =>
{
readParameter(reader, meas);
reader.Close();
((IDisposable)reader).Dispose();
}
);
}
But it read the same node in every iteration of the foreach
. How can I fix this? is this even the good way to parallelize the reading?
Upvotes: 0
Views: 549
Reputation: 111870
Because, as written in the remarks of ReadSubtree:
ReadSubtree can be called only on element nodes. When the entire sub-tree has been read, calls to the Read method returns false. When the new XmlReader has been closed, the original XmlReader will be positioned on the EndElement node of the sub-tree. Thus, if you called the ReadSubtree method on the start tag of the book element, after the sub-tree has been read and the new XmlReader has been closed, the original XmlReader is positioned on the end tag of the book element. You should not perform any operations on the original XmlReader until the new XmlReader has been closed. This action is not supported and can result in unpredictable behavior.
Clearly this method isn't thread-safe. You can't "put aside" some ReadSubtree()
and then use them later as you are trying to do.
In general, considering that XmlReader
represents a reader that provides fast, noncached, forward-only access to XML data.
Clearly you can't do what you want. In general because the Stream
the XmlReader
is using could be forward-only, so cloning it would require that the Stream
be "forked" (one "copy" for each clone of XmlReader
) (something not guaranteed to be possible by the Stream
) or that the XmlReader
is caching the nodes (something that is guaranteed not to be done by the XmlReader
)
As suggested by @mike z, you could
List<XElement> elements = new List<XElement>();
while (m.ReadToFollowing("PAR"))
{
elements.Add(XElement.Load(m.ReadSubtree()));
}
Parallel.ForEach(elements, el =>
{
});
But I'm not sure this would change anything, other than your memory use (watch more than 2gb of memory go away :-) ), because now the whole Xml parsing is done in the "main" thread, and all the PAR elements are read in XDocument
objects.
Or probably you could try:
public sealed class MyClass : IEnumerable<XElement>, IDisposable
{
public readonly XmlReader Reader;
public MyClass(XmlReader reader)
{
Reader = reader;
}
// Sealed class
public void Dispose()
{
Reader.Dispose();
}
public IEnumerator<XElement> GetEnumerator()
{
while (Reader.ReadToFollowing("PAR"))
{
yield return XElement.Load(Reader.ReadSubtree());
}
}
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
}
private static void readParameters(XmlReader m, Measurement meas)
{
var enu = new MyClass(m);
Parallel.ForEach(enu, reader =>
{
// You do the work here
});
}
Now the Parallel.ForEach
is lazily feeded by an enumerator MyClass
(excuse me for the name :-) ) that will lazily load the subtrees.
Upvotes: 3