Abdull
Abdull

Reputation: 27872

Java: How to filter a big collection of objects with a big collection of predicates?

In Java, I have a big collection of objects (~ 10,000 objects), say Set<Person> cityInhabitants. I also have a big collection of predicates (~ 1,000 predicates) which shall be used to filter away any Person matching any of these predicates. Predicates could be for example

This requirement calls for the following challenges:

What are solutions to these challenges? Are there any libraries that can help here?

Upvotes: 6

Views: 1997

Answers (4)

omerkudat
omerkudat

Reputation: 10111

(I hadn't realized this question was 2 years old. I'm so late to this party! It would be good to know what solution the author ended up using.)

Are there any libraries that can help here? Well, sure there are!

Your data collection isn't very large, but you have a disproportionately large number of predicates. Plus, you want these predicates to be managed by your users, and stored centrally etc. This sounds like a good fit for Drools, which is a rules engine, and comes with additional tooling to author, validate and store such rules.

But Drools can be large and involved. Perhaps you need something simpler? Your code sample, and your first requirement of speed, made me think of CQEngine, which is a library for indexing objects. It indexes fields (such as your 'name field) and it can search these fields in various ways (equals, starts with, contains, etc). It is fast and it is simple, but it can only index. You would yourself have to come up with rule definitions etc. On the other hand, CQEngine supports logical predicates, so you can chain your predicates together with and/or.

And there are other libraries for rules engines or object indexing. I'm sure other people will list them in their answers.

Upvotes: 1

mbmast
mbmast

Reputation: 1110

Here's an alternative: Identify all possible attributes that an instance of a class might have. In your example, you have a person class with two attributes; name and age. Because you have getters for these attributes, it's likely that at most, a person can have two attributes (unless there are other getters that you didn't mention). You could have implemented person such that the attributes are held in a collection, so that you really have no limit on the number of attributes. Regardless of how it is implemented, identify all the attributes.

Now for each attribute, associate a unique prime number and then for each instance of person maintain the product of those prime numbers corresponding to those attributes assigned to that person. For example, assume a person can be young or old, male or female, good looking or bad looking. That's 6 attributes and let's assign prime numbers as follows:

02: young
03: old
05: male
07: female
11: good looking
13: bad looking

Continuing the example, assume a person is a good looking, young female. The product of the prime numbers would be 2 X 7 X 11, or 154.

Now you want to find all good looking young people, regardless of gender. The product of primes associated with this predicate is 2 X 11, or 22.

So you can now iterate through all your people and if the product of primes associated with each people can be divided by 22 without any remainder (it can in the case where the person with a product of primes is 154), then you have a match.

You might want to use the BigNumber class to perform the multiplication, division and the storing of the product of primes.

This solution is very fast if you are given a person and asked if it matches all the predicates (again, the predicates have been reduced to unique prime numbers and the collection of predicates is now represented by the product of those prime numbers).

This solution may not be so fast if you have to iterate over your entire collection of people looking for a match.

Upvotes: 1

Tom
Tom

Reputation: 16208

I would suggest you sort the predicates in order of speed of execution. You can then execute your predicates in order of speed, using the fastest ones first, generally meaning the slower predicates will have to run over a smaller set.

However this assumption is not completely correct, you would need to work out the percentage of predicates removed to speed of execution. Then we can see which is the fastest predicate that removes the highest percentage of objects. We can then execute the predicates in this order to me most optimal.

You can easily implement your own predicate interface

public interface Predicate<T> {

    boolean filter(T object);

}

You would then need to create predicate object for each of the rules. You can make some more dynamic classes for age and name checking which will reduce the amount of code you will need also.

public class AgeCheck<T> implements Predicate<T> {

    private final int min;
    private final int max;
    public AgeCheck(int min, int max) {
        this.min = min;
        this.max = max;
    }

    @Override
    public boolean filter(T object) {
        // if( t.age() < max && t.age > min) ...
    }

}

Upvotes: 2

Jack
Jack

Reputation: 133669

In this situation there is not much you can do in relation to the complexity of the operation itself. If entries are many, predicates are many and predicates are expensive then you can optimize to be fast as you can but you won't surely get over a certain threshold because the single operation here maybe expensive.

You should test different approaches and see whatever performs better in your specific situation:

  • sort predicates according by first checking the ones that should be wider (so that the first predicates will filter out as many entries as possible)
  • sort predicates according to their complexity (so that faster will be executed first and the slower on less entries)
  • don't update the original data structure but keep a parallel set that will contain the filtered elements vs
  • always update the data structure so that you will loop over smaller amount of people everytime

Upvotes: 2

Related Questions