Dealing with large structured data sets

Question

What I'm asking is rather a methodology, than a concrete solution. I will start by describing the situation I found challenging, and will then proceed to the question. Hope it makes more sense to do it this way.

I'm dealing with data extracted from natural language. This data must be later analysed against a sort of "knowledge base" (it's quoted because it's not really a knowledge base, I'll get to it later). The knowledge base is large, its volume, so far theoretically, but soon practically will surpass what is possible to store in-memory. My two concerns are these:

Moving the data over to a database server will mean speed reduction by a factor... well, I don't know what factor, but it could be easily several orders of magnitude. I.e. a task of finding a piece of data in an object native to the runtime located in memory is significantly faster, then querying the database.
The entire huge volume of data is not required at any one time. In fact, only a very small fraction is used, so, perhaps some caching could aid the problem. I'm actually hoping that someone already faced this problem and caching was the right answer.

The "knowledge base" is so far just a complex data structure, which can be queried about in a similar way you would query a database by using some query language. I.e. it is not a simple lookup value by key operation, it requires multiple sub-queries to identify an object as matching the given criteria.

Just to give you more concrete example of what I'm trying to do. Unlike langutils, I'm trying to come up with a parser, which I'm calling "predictive parser", sorry if the term is already taken and means something else :) The main idea is that instead of assigning POS tags to words, and then iteratively correcting the original assumption by applying a set of rules to the inferred information, I'm trying to do it in a way, that given a certain prefix, the engine would generate a continuation, based on its "learned knowledge". I.e. suppose the knowledge base learned that the prefix "I could " is almost certainly followed by a verb phrase. And so the parser would assume the verb phrase and parse it as such, unless it hits an error. The difficult part is finding the proper prefix. The bad thing is that prefices like "I will " and "Thou shalt" will get equal priority, i.e. they would be checked for the match in the same order, either random, alphabetical etc. The idea is though that during the knowledge acquisition the knowledge base would learn to store and look up the information in such way, that the most likely prefices would be looked up first, and the least likely prefices would not be even loaded initially.

This is the concept somewhat similar to how CPU cache works. So, if what I wrote is too long: I'm looking for a data structure, that functions like CPU cache, where what's currently cached resides in memory, and what isn't is stored in a database, or as a file etc.

PS. Sorry for my collection of tags. I feel like it's not really describing my question. You are welcome to adjust it, if you know where the question belongs.

Dealing with large structured data sets

Answers (1)

Related Questions