Reputation: 17186
I've not done much with linq to xml, but all the examples I've seen load the entire XML document into memory.
What if the XML file is, say, 8GB, and you really don't have the option?
My first thought is to use the XElement.Load Method (TextReader) in combination with an instance of the FileStream Class.
QUESTION: will this work, and is this the right way to approach the problem of searching a very large XML file?
Note: high performance isn't required.. i'm trying to get linq to xml to basically do the work of the program i could write that loops through every line of my big file and gathers up, but since linq is "loop centric" I'd expect this to be possible....
Upvotes: 15
Views: 8312
Reputation: 5837
I realize that this answer might be considered non-responsive and possibly annoying, but I would say that if you have an XML file which is 8GB, then at least some of what you are trying to do in XML should be done by the file system or database.
If you have huge chunks of text in that file, you could store them as individual files and store the metadata and the filenames separately. If you don't, you must have many levels of structured data, probably with a lot of repetition of the structures. If you can decide what is considered an individual 'record' which can be stored as a smaller XML file or in a column of a database, then you can structure your database based on the levels of nesting above that. XML is great for small and dirty, it's also good for quite unstructured data since it is self-structuring. But if you have 8GB of data which you are going to do something meaningful with, you must (usually) be able to count on some predictable structure somewhere in it.
Storing XML (or JSON) in a database, and querying and searching both for XML records, and within the XML is well supported nowadays both by SQL stuff and by the NoSQL paradigm.
Of course you might not have the choice of not using XML files this big, or you might have some situation where they are really the best solution. But for some people reading this it could be helpful to look at this alternative.
Upvotes: 1
Reputation: 13574
Gabriel,
Dude, this isn't exactly answering your ACTUAL question (How to read big xml docs using linq) but you might want to checkout my old question What's the best way to parse big XML documents in C-Sharp. The last "answer" (timewise) was a "note to self" on what ACTUALLY WORKED. It turns out that a hybrid document-XmlReader & doclet-XmlSerializer is fast (enough) AND flexible.
BUT note that I was dealing with docs upto only 150MB. If you REALLY have to handle docs as big as 8GB? then I guess you're likely to encounter all sorts of problems; including issues with the O/S's LARGE_FILE (>2GB) handling... in which case I strongly suggest you keep things as-primitive-as-possible... and XmlReader is as primitive as possible (and THE fastest according to my testing) XML-parser available in the Microsoft namespace.
Also: I've just noticed a belated comment in my old thread suggesting that I check out VTD-XML... I had a quick look at it just now... It "looks promising", even if the author seems to have contracted a terminal case of FIGJAM. He claims it'll handle docs of upto 256GB; to which I reply "Yeah, have you TESTED it? In WHAT environment?" It sounds like it should work though... I've used this same technique to implement "hyperlinks" in a textual help-system; back before HTML.
Anyway good luck with this, and your overall project. Cheers. Keith.
Upvotes: 8
Reputation: 25742
Using XElement.Load
will load the whole file into the memory. Instead, use XmlReader
with the XNode.ReadFrom
function, where you can selectively load notes found by XmlReader
with XElement
for further processing, if you need to. MSDN has a very good example doing just that: http://msdn.microsoft.com/en-us/library/system.xml.linq.xnode.readfrom.aspx
If you just need to search the xml document, XmlReader
alone will suffice and will not load the whole document into the memory.
Upvotes: 14