Reputation: 21
I have around 50000 XML files with a size of 50KB per file. I want to search for data in these files, but my solution so far is very slow. Is there any way to enhance the search performance?
Upvotes: 2
Views: 1797
Reputation: 163468
Use an XML database. The usual recommendations are eXist if you want open source, MarkLogic if you want something commercial, but you could use SQL Server if being Microsoft matters to you and you don't want the ultimate in XML capability. And there are plenty of others if you want to evaluate them. All database products have a steep learning curve, but for these data volumes, it's the right solution.
Upvotes: 0
Reputation: 10250
You could spin up a Splunk instance and have it index your files. It's billed mostly as a log parser but would still serve your needs. It tokenizes files into words, indexes those words, and provides both a web-based and a CLI-based search tool that supports complex search criteria.
Upvotes: 0
Reputation: 13266
I am assuming you are using Windows and you can use Windows desktop search for quickly searching the files. You will be using the Windows index which would update when ever the file changes. The SDK is available here which can be used from .NET
Upvotes: 1
Reputation: 27282
A lot depends on the nature of these XML files. Are they just 50,000 XML files that won't be re-generated? Or are they constantly changing? Are there only certain elements within the XML files you want to index for searching?
Certainly opening 50k file handles, reading their contents, and searching for text is going to be very slow. I agree with Pavel, putting the data in a database will yield a lot of performance, but if your XML files are changing often, you will have to have some way to keep them synchronized with the database.
If you want to roll your own solution, I recommend scanning all the files and creating a word index. If your files change frequently, you will also want to keep track of your "last modified" date, and if a file has changed more recently than that, update your index. In this way, you'll have one ginormous word index, and if the search is for "foo", the index will reveal that the word can be found in the files file39209.xml, file57209 and file01009.xml. Depending on the nature of the XML, you could even store the elements in the index file (which would, in essence, be like flattening all of your XML files into one).
Upvotes: 0
Reputation: 2429
You could use Lucene.NET, a lightweight, fast, flat file search indexing engine.
See http://codeclimber.net.nz/archive/2009/09/02/lucene.net-your-first-application.aspx for a getting started tutorial.
Upvotes: 6
Reputation: 6293
You can always index content of files to database and perform search there. Databases are pretty performant in terms of search.
Upvotes: 1