Parallelize XML Reading gone wrong

Question

I work with big XML Files (~2Go), up to now, the reading was done this way:

private void readParameters(XmlReader m, Measurement meas)
{
    while (m.ReadToFollowing("PAR"))
    {
        XmlReader par = m.ReadSubtree();
        readParameter(par, meas);
        par.Close();
        ((IDisposable)par).Dispose();
    }
}

Which went well, but was slooooow. So I bring my science in, tried to parallelize the reading:

private void readParameters(XmlReader m, Measurement meas)
{
    List readers = new List();
    while (m.ReadToFollowing("PAR"))
    {
        readers.Add(m.ReadSubtree());
    }

    Parallel.ForEach(readers, reader =>
        {
            readParameter(reader, meas);
            reader.Close();
            ((IDisposable)reader).Dispose();
        }
    );
}

But it read the same node in every iteration of the foreach. How can I fix this? is this even the good way to parallelize the reading?

xanatos · Accepted Answer

Because, as written in the remarks of ReadSubtree:

ReadSubtree can be called only on element nodes. When the entire sub-tree has been read, calls to the Read method returns false. When the new XmlReader has been closed, the original XmlReader will be positioned on the EndElement node of the sub-tree. Thus, if you called the ReadSubtree method on the start tag of the book element, after the sub-tree has been read and the new XmlReader has been closed, the original XmlReader is positioned on the end tag of the book element. You should not perform any operations on the original XmlReader until the new XmlReader has been closed. This action is not supported and can result in unpredictable behavior.

Clearly this method isn't thread-safe. You can't "put aside" some ReadSubtree() and then use them later as you are trying to do.

In general, considering that XmlReader

represents a reader that provides fast, noncached, forward-only access to XML data.

Clearly you can't do what you want. In general because the Stream the XmlReader is using could be forward-only, so cloning it would require that the Stream be "forked" (one "copy" for each clone of XmlReader) (something not guaranteed to be possible by the Stream) or that the XmlReader is caching the nodes (something that is guaranteed not to be done by the XmlReader)

As suggested by @mike z, you could

List elements = new List();

while (m.ReadToFollowing("PAR"))
{
    elements.Add(XElement.Load(m.ReadSubtree()));
}

Parallel.ForEach(elements, el =>
{
});

But I'm not sure this would change anything, other than your memory use (watch more than 2gb of memory go away :-) ), because now the whole Xml parsing is done in the "main" thread, and all the PAR elements are read in XDocument objects.

Or probably you could try:

public sealed class MyClass : IEnumerable, IDisposable
{
    public readonly XmlReader Reader;

    public MyClass(XmlReader reader)
    {
        Reader = reader;
    }

    // Sealed class
    public void Dispose()
    {
        Reader.Dispose();
    }

    public IEnumerator GetEnumerator()
    {
        while (Reader.ReadToFollowing("PAR"))
        {
            yield return XElement.Load(Reader.ReadSubtree());
        }
    }

    System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}

private static void readParameters(XmlReader m, Measurement meas)
{
    var enu = new MyClass(m);

    Parallel.ForEach(enu, reader =>
    {
        // You do the work here 
    });
}

Now the Parallel.ForEach is lazily feeded by an enumerator MyClass (excuse me for the name :-) ) that will lazily load the subtrees.

Parallelize XML Reading gone wrong

Answers (1)

Related Questions