coffee dude
coffee dude

Reputation: 1

How to quickly compare one set of booleans to many other sets of booleans (order matters)?

I'm running into a problem with a project I'm working on in my spare time. I'm using Google App Engine (Java version), but this question is not specific to that platform, and I would consider other languages/platforms if they could solve the problem.

The following illustrates the problem:

Suppose I have a datastore with thousands of recipes, and the ingredients for each recipe. (For the sake of this illustration, forget about measurements.) I want to be able to enter a list of ingredients that I have on hand, and then quickly retrieve all recipes for which I have at least XX% of the ingredients (let's say 75%). I'm willing to sacrifice some accuracy and some results for speed, but do want a certain degree of accuracy. I can do a more thorough comparison after I get the "quick results."

My attempt at a solution: Analyzing the database of recipes, I compile a list of, say, 200 common food ingredients (eggs, flour, salt, sugar, rosemary, etc). Almost all the ingredients for the recipes are contained within this master list:

Common Food Ingredients: [ eggs , flour , salt , sugar , cinnamon ... ]

Then, I go through each individual recipe and compare the ingredients to this master list, and end up with a set of 200 booleans for each recipe:

Recipe #106: [ T , T , F , T , F ... ]
Recipe #107: [ F , T , T , T , F ... ]

I would store this information with the recipes. (Up to this point, it's all data prep work, which I have all the time in the world to do.)

Now, I enter my list of ingredients on hand. I would do the same comparison with the master list:

My ingredients on hand: [ F , F , T , T , F ... ]

And this is where I'm stuck. How can I quickly compare this set of booleans against the sets for the recipes so I can identify recipes for which I have at least 75% of the ingredients?

Or (and this would be the holy grail), during the data preparation, instead of storing the set of booleans themselves with each recipe, is there a calculation I can perform that will give me a single value I can later filter off of? (E.g., "SELECT * FROM recipes WHERE master_list_boolean_metric <= 29")

Or am I going about this the wrong way? (Any guidance, general or specific, would be appreciated.) What I want to avoid is doing a slow comparison, ingredient by ingredient, between each recipe and my list of "on-hand" ingredients.

Or... perhaps it isn't possible to do this quickly?

Upvotes: 0

Views: 75

Answers (1)

Bwmat
Bwmat

Reputation: 4663

use BitSet.

store each ingredient as one bit, do an AND with the ingredients you have, and then filter on cardinality()

Upvotes: 1

Related Questions