Lilienthal
Lilienthal

Reputation: 4378

Analysing (tens of) thousands of XML files

I have an application that will generate a log entry for every search query it processes (from Solr) so that I can calculate certain statistics for my search engine. For instance the number of queries without results, the average number of hits, et cetera. Now I'm wondering how best to perform this analysis. Load is estimated at a high of ten thousands searches per day with statistics generated over a weekly period. In other words, I'm looking for the best way to calculate statistics on up to one hundred thousand XML files.

The process will be controlled by Apache Camel and I'm currently thinking that XQuery will be my best bet of tackling this problem. As I'm still working on establishing the schema I can't run any real world tests so I wanted to garner some opinions on the best approach to take before I dive in. Some questions:

Upvotes: 0

Views: 561

Answers (3)

Alexandre Rafalovitch
Alexandre Rafalovitch

Reputation: 9789

Do they have to be in XML format? I would very strongly explore loading this statistics into database of some sort. Either normal database if the fields/categories of information are regular or into one of the schema-less NoSQL databases if they are not. This would makes deriving statistics much easier.

You can even load it back into a Solr (separate core) using either concrete schema or dynamic fields, if your logged criterias may change.

Upvotes: 0

Michael Kay
Michael Kay

Reputation: 163468

Either XSLT 2.0 or XQuery 1.0 can handle this in principle, but the performance depends on the actual volumes and on the complexity of the queries. Generally, (I know it sounds banal) XSLT is better at transformation (generating a new document from each source document) while XQuery is better at query (extracting a small amount of information from each source document). There's no particular point in merging all the small documents into one big document. I would say also that there's not much point in putting them in a database unless either (a) you really need the cross-indexing this will provide, or (b) you're going to use the documents repeatedly over a period of time.

Upvotes: 3

dirkk
dirkk

Reputation: 6218

Answers in the respective order of the questions:

  • Yes, XQuery can handle an indefinite number of files using collections, take a look at the fn:collection() function
  • The "right tool" is a highly subjective question and debatable, therefore it does not really some to fit to SO. However, if you want to work with XML documents, XQuery is an obvious choice as it is exactly designed for that. But of course this also depends on other factors, e.g. your skill set
  • Surely an index will speed up the job. If it is really necessary depends on a number of factors, e.g. the size of the files and the expected workload. It is very hard to give an actual answer here, but as a general rule indexing stuff is always a good idea. However, if you update very often, it might by costly to maintain the index. It is hard to tell if your application will benefit from it, as it depends on the workload, # of expected reads and writes and many more factors
  • I would very much not recommend just storing them on a file system. Before you asked to index them in Apache Lucene/Solr, so why not index them using a XML database? If you have hundred thousand XML files and store them just on the filesystem, processing them will quite likely be awfully slow. It sounds very much like a job for a XML database. There are different ones out there like MarkLogic (commercial), eXist (Open-Source) or BaseX (Open-Source) to name a few.

Upvotes: 2

Related Questions