Reputation:
What I'm asking is rather a methodology, than a concrete solution. I will start by describing the situation I found challenging, and will then proceed to the question. Hope it makes more sense to do it this way.
I'm dealing with data extracted from natural language. This data must be later analysed against a sort of "knowledge base" (it's quoted because it's not really a knowledge base, I'll get to it later). The knowledge base is large, its volume, so far theoretically, but soon practically will surpass what is possible to store in-memory. My two concerns are these:
Moving the data over to a database server will mean speed reduction by a factor... well, I don't know what factor, but it could be easily several orders of magnitude. I.e. a task of finding a piece of data in an object native to the runtime located in memory is significantly faster, then querying the database.
The entire huge volume of data is not required at any one time. In fact, only a very small fraction is used, so, perhaps some caching could aid the problem. I'm actually hoping that someone already faced this problem and caching was the right answer.
The "knowledge base" is so far just a complex data structure, which can be queried about in a similar way you would query a database by using some query language. I.e. it is not a simple lookup value by key operation, it requires multiple sub-queries to identify an object as matching the given criteria.
Just to give you more concrete example of what I'm trying to do. Unlike langutils
, I'm trying to come up with a parser, which I'm calling "predictive parser", sorry if the term is already taken and means something else :) The main idea is that instead of assigning POS tags to words, and then iteratively correcting the original assumption by applying a set of rules to the inferred information, I'm trying to do it in a way, that given a certain prefix, the engine would generate a continuation, based on its "learned knowledge". I.e. suppose the knowledge base learned that the prefix "I could " is almost certainly followed by a verb phrase. And so the parser would assume the verb phrase and parse it as such, unless it hits an error. The difficult part is finding the proper prefix. The bad thing is that prefices like "I will " and "Thou shalt" will get equal priority, i.e. they would be checked for the match in the same order, either random, alphabetical etc. The idea is though that during the knowledge acquisition the knowledge base would learn to store and look up the information in such way, that the most likely prefices would be looked up first, and the least likely prefices would not be even loaded initially.
This is the concept somewhat similar to how CPU cache works. So, if what I wrote is too long: I'm looking for a data structure, that functions like CPU cache, where what's currently cached resides in memory, and what isn't is stored in a database, or as a file etc.
PS. Sorry for my collection of tags. I feel like it's not really describing my question. You are welcome to adjust it, if you know where the question belongs.
Upvotes: 4
Views: 285
Reputation: 9451
If we just consider this part:
The idea is though that during the knowledge acquisition the knowledge base would learn to store and look up the information in such way, that the most likely prefices would be looked up first, and the least likely prefices would not be even loaded initially.
then, if I understood you correctly, you're dealing with the task of handling n-grams. As in your situation you're not putting any explicit limits on the prefixes, it can be assumed, that the generally reasonable limits apply, and those are 4-5 word n-grams. There's a lot of such n-grams: from a real-world corpus you'd easily get gigabytes of data. But even if you limit yourself to only 3-grams, you'll still get at least a couple of gigabytes, unless you perform some clever pre-processing, that will somehow separate the "good" n-grams. (Coupled with proper smoothing this may be a feasible solution).
The bad news about n-grams besides their size is that they are distributed by the Zipf's law, which basically means, that caching won't be very useful.
So, I'd just put the data into some fast database on the local machine (maybe, some variant of dbm). If you can fit it all in memory, maybe, Memcached or Redis will be faster.
Upvotes: 1