Reputation: 579
Story: I have a big XML file of about 70+ GB I need to parse to my database once a week. Currently I have a working parser in vb.net using the XmlReader. I'm currently maxing out at about 5.000 nodes / sec and with >10.000.000 nodes and increasing, it's taking a while to complete. The program is running on my own server in the garage with an ordinary SSD. I assumed the limiting factor was the SSD, so I recently upgraded to a Samsung EVO 970 M.2 SSD with 6x more read/write speeds. The problem is that I didn't see any noticeable performance increase. Looking back it's probably obvious that the bottleneck is somewhere else..
Idea: I started investigating. I implemented 2 separate threads, each reading the file from the beginning, a few seconds apart. Each thread still read and processed about 5.000 nodes / sec so effectively I was processing 10.000 nodes / sec now. However the problem was that I was parsing every node twice, which kind of eliminated the purpose. Next idea was for one thread to read + process the data from the beginning. The second thread would also read from the beginning of the file, however this thread would simply skip the first half of the file, before it would start processing the data. Using the XMLReader.ReadToNextSibling() I was able to "skip" 6.250.000 nodes at a rate of 10.000 nodes / sec. This essentially means the first thread would have processed about 3.125.000 nodes by the time the second thread finishes the "skip" and starts parsing from 6.250.000 nodes in. At this point there would be about 6.875.000 nodes left split between the 2 threads and the parser would from this point on process about 10.000 nodes / sec. Essentially I would like to increase the threads until I reach another bottleneck. This approach is very primitive and "wastes" a lot of time reading and skipping the same nodes. I tried XmlReaderSettings.LineNumberOffset, however I couldn't get this to work and it always seemed to read from the beginning no matter the offset.
Dim settings = New Xml.XmlReaderSettings()
settings.LineNumberOffset = 100000
Dim XMLReader = Xml.XmlReader.Create("C:\largexml.xml", settings)
Question: Any ideas on parallel read of large XML files and possible bottlenecks or optimizations. Is there a faster way to "skip" n elements than using ReadToNextSibling like this?
Dim Count As Integer = 0
While Count < 5000000
XMLReader.ReadToNextSibling("ns")
Count += 1
End While
I've made som calculations of the current "skip" solution, and since I have to read the same data in each thread, then the speed bonus is asymptotic to +50% speed boost. Here is the graph in case anyone has the same problem and is considering a similar naive solution. Here is the expected performance increase / thread.
Upvotes: 2
Views: 657
Reputation: 34421
See if following code is faster. The XElement.ReadFrom() reads a sibling so it only reads an item once. Check in Task Manager the memory usage as code runs. If you run out of memory than swap space on disk is used which can really slow down an app.
Imports System.Xml
Imports System.Xml.Linq
Module Module1
Const FILENAME As String = "C:\largexml.xml"
Sub Main()
Dim reader As XmlReader = XmlReader.Create(FILENAME)
While Not reader.EOF
If reader.Name <> "ns" Then
reader.ReadToFollowing("ns")
End If
If Not reader.EOF Then
Dim ns As XElement = XElement.ReadFrom(reader)
End If
End While
End Sub
End Module
Upvotes: 1