Reputation: 4140

Make file IO with LINQ more efficient with large numbers of small XML files?

I have a batch of roughly 13 thousand XML files (and growing by potentially hundreds per day) that I need to process with LINQ filtering and transforming of the data to what I need and aggregate each of seven possible event types into a single event type file (see below). So, 13k files into 7 files. The event types are well delineated in the XML, so the filtering & aggregation is relatively easy. These aggregate files will then be used to create a MySQL insert statement to our database using a script I've already written that also works well.

I have functional code and it's processing the files, but it's been running for 23+ hours thus far and looks like it's probably only about half done(?). I neglected to put in a file counter, so I don't really know, and I'm loath to restart it again. I can make educated guesses judging by the size of the original files (360mb or so) vs. the processed file sizes (180mb or so). I anticipate having to run this possibly about half a dozen times until we dump this method of data collection (using XML files as a database) and transition to using exclusively MySQL, so I'm hoping I can find a more efficient method of processing the files. I don't really want to spend potentially 2+ days per execution if I don't have to.

It's running locally on my machine but only on 1 HD (10k RPM Barracuda I think). Would it possibly be faster reading from one drive and writing to separate drive? I'm pretty sure my bottlenecks are caused by file IO, I'm opening and closing the files literally thousands of times. Maybe I can refactor to only open once for reading and do everything in memory? I know that'd be faster, but I risk losing a whole file's worth of data if something goes awry. I still have to open each of the 13k files to read them, process them and write out to an XElement.

Here's the code I'm running. I'm using LINQPad and running the code as C# statements, but I can turn it into a real executable if necessary. LINQPad is just so convenient for prototyping stuff like this! Please let me know if examples of the XML would make this easier to figure out, but at first blush it doesn't seem germane. The files range in size from 2k to 285k, but only 300 or so are above 100k, most are in the 25 - 50k range.

string sourceDir = @"C:\splitXML\results\XML\";//source for the 13k files
string xmlDestDir = @"C:\results\XMLSorted\";//destination for the resultant 7 files
List<string> sourceList = new List<string>();
sourceList = Directory.EnumerateFiles(sourceDir, "*.xml", SearchOption.AllDirectories).ToList();
string destFile = null;
string[] events = { "Creation", "Assignment", "Modification", "Repair", "RepairReview", "Termination", "Test" };
foreach(string eventItem in events)
{
try
{
        //this should only happen once the first time through and 
        //shouldn't be a continuing problem
        destFile = Path.Combine(xmlDestDir, eventItem + "Uber.xml");
    if (!File.Exists(destFile))
    {
        XmlTextWriter writer = new XmlTextWriter( destFile, null );
        writer.WriteStartElement( "PCBDatabase" );
        writer.WriteEndElement();
        writer.Close();
    }
}
catch(Exception ex)
{
    Console.WriteLine(ex);
}
}

foreach(var file in sourceList) //roughly 13k files
{
    XDocument xd = XDocument.Load(file);    
    var actionEvents =
        from e in xd.Descendants("PCBDatabase").Elements()
    select e;
foreach(XElement actionEvent in actionEvents)
{
    //this is where I think it's bogging down, it's constant file IO
        var eventName =
    from e in actionEvents.Elements()
    select e.Name;
    var eventType = eventName.First();
    destFile = Path.Combine(xmlDestDir, eventType + "Uber.xml");
        //another bottle neck opening each file thousands of times
    XElement xeDoc = XElement.Load(destFile);
    xeDoc.Add(actionEvent);
        //and last bottle neck, closing each file thousands of times
        xeDoc.Save(destFile);
    }
}

Upvotes: 4

Answers (3)

Mike Zboray

Reputation: 40818

You are spending a huge amount of time reopening your xml files and parsing them into XDocument objects. Since these Uber files are going to be quite large what you want to do is open them once and write in a forward only fashion. The code below is a sample of how you would go about that. I also moved getting the eventType out of the inner loop (since it did not depend on the inner loop variable).

Note that this sample will recreate the Uber files from scratch each time. If that is not what you to do, what I would suggest instead of reading them into XDocument is using the code below to create "temp" files and then use two XmlReader instances to read the files and merge the contents with an XmlWriter.

using System.IO;
using System.Xml;
using System.Xml.Linq;
using System.Linq;

public static void Main(string[] args)
{
    string sourceDir = @"C:\splitXML\results\XML\";
    string xmlDestDir = @"C:\results\XMLSorted\";
    string[] events = { "Creation", "Assignment", "Modification", "Repair", "RepairReview", "Termination", "Test" };
    Dictionary<string, XmlWriter> writers = events.ToDictionary(e => e, e => XmlWriter.Create(Path.Combine(xmlDestDir, e + "Uber.xml")));

    foreach(var writer in writers.Values)
    {
        writer.WriteStartDocument();
        writer.WriteStartElement("PCBDatabase");
    }

    foreach(var file in Directory.EnumerateFiles(sourceDir, "*.xml", SearchOption.AllDirectories)) //roughly 13k files
    {
        XDocument xd = XDocument.Load(file);    
        var actionEvents = from e in xd.Descendants("PCBDatabase").Elements() select e;
        string eventType = (from e in actionEvents.Elements() select e.Name.ToString()).First();

        foreach(XElement actionEvent in actionEvents)
        {
            actionEvent.WriteTo(writers[eventType]);
        }    
    }

    foreach(var writer in writers.Values)
    {
        writer.WriteEndElement();
        writer.WriteEndDocument();
        writer.Close();
    }            
}

Upvotes: 2

Servy

Reputation: 203820

Writing to the result file (and more importantly, loading it every time you want to add an element) is indeed what's killing you. Storing all of the data you want to write in memory is also problematic, if for no other reason then you may not have enough memory to do that. You need a middle ground, and that means batching. Read in a few hundred elements, store them in a structure in memory, and then once it gets sufficiently large (play around with changing the batch size to see what works best) write them all out to the output file(s).

We'll therefore start with this Batch function that batches out an IEnumerable:

public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int batchSize)
{
    List<T> buffer = new List<T>(batchSize);

    foreach (T item in source)
    {
        buffer.Add(item);

        if (buffer.Count >= batchSize)
        {
            yield return buffer;
            buffer = new List<T>(batchSize);
        }
    }
    if (buffer.Count >= 0)
    {
        yield return buffer;
    }
}

Next the query that you're using can actually be refactored to use LINQ more effectively. You have several selects that aren't really doing anything, and can use SelectMany instead of explicit foreach loops to pull it all into one query.

var batchesToWrite = sourceList.SelectMany(file =>
        XDocument.Load(file).Descendants("PCBDatabase").Elements())
    .Select((element, index) => new
    {
        element,
        index,
        file = Path.Combine(xmlDestDir, element.Elements().First().Name + "Uber.xml"),
    })
    .Batch(batchsize)
    .Select(batch => batch.GroupBy(element => element.file));

Then just write out each of the batches:

foreach (var batch in batchesToWrite)
{
    foreach (var group in batch)
    {
        WriteElementsToFile(group.Select(element => element.element), group.Key);
    }
}

As to actually writing out the elements to the file, I've extracted that out into a method because there are likely different ways of writing your output. You can start out with the implementation you're using, just to see how you're doing:

private static void WriteElementsToFile(IEnumerable<XElement> elements, string path)
{
    XElement xeDoc = XElement.Load(path);
    foreach (var element in elements)
        xeDoc.Add(element);
    xeDoc.Save(path);
}

But you still have the issue that you're reading in the entire input file just to append elements to the end. The batching alone may have mitigated this enough for your purposes, but if it has not the you may wish to address this method alone, possibly using something other than LINQ to XML to write the results so that you don't need to load the entire file into memory just to create this one document.

Upvotes: 2

xanatos

Reputation: 111860

You have done a classical antipattern: the Schlemiel the Painter.

With each file you re-read one of the uber XML, modify it and re-write it fully... So the more files you already processed, the slower it's to process a new file. Considering the total size of your files, perhaps it would have been better to keep the uber files in memory and write them only at the end of the process.

Another possible solution is to keep open various XmlWriter(s), one for each of the uber files, and write to them. They are stream-based, so you can always append new items, and if they are backed by a FileStream, these writers will save to the files.

Upvotes: 2

Make file IO with LINQ more efficient with large numbers of small XML files?

Answers (3)

Related Questions