Reputation: 8922
I have a large collection of objects of type foo
. Each object of type foo
has say 100 properties (all strings) plus an id. An object of type bar
also has these 100 properties.
I want to find the matching object of type foo from the collection where all these properties match with that of bar.
Aside from the brute force method, is there an elegant algorithm where we can calculate a signature for foo
objects once and do the same for the bar
object and match more efficiently?
The foo
s are in the thousands and the bar
s are in the millions.
Upvotes: 0
Views: 242
Reputation: 1769
Hash is very nice and simple to implement.. But i want suggest you that algorithm:
So... Complicity of algorithm is - O(Sum(|Ai|) + Sum(|Bi|)) = O(max(Sum(|Ai|), Sum(|Bi|)) = O(Sum(|Bi|)) for your problem. Ai - string unique id for first set, Bi - string unique id for second set.
UPDATE: Trie takes O(Sum(|Ai|) * |Alphabet|) space at worst.
Upvotes: 0
Reputation: 40345
Darth Vader has a point there... and I never thought that I'd be siding with the dark side!
I'll go over what I think are the best tools for the trade:
The Embedded Database
The goal of using an embedded database is that you will get performance that will beat most database solutions that you're likely to encounter. We can talk about just how fast LevelDB is, but plenty of other people have already talked about it quite a bit so I won't waste time. The embedded database allows you to store key/value pairs and quickly find them in your database.
The Hashing Function
A good hashing function will be fast and it will provide a good distribution of non-repeatable hashes. CityHash is very fast and it has very good distribution, but again: I won't waste time since a lot of other people have already talked about the performance of CityHash. You would use the hashing function to hash your objects and then use the unique key to look them up in the database.
JSON Serialization
JSON Serialization is the antithesis of what I've shown above: it's very slow and it will diminish any performance gain you achieved with CityHash, but it gives you a very simple way to hash an entire object. You serialize the object to a JSON string, then you hash the string using CityHash. Despite the fact that you've lost the performance gains of CityHash because you spent so much time serializing the object to JSON, you will still reap the benefits of having a really good hashing function.
The Conclusion
Enjoy!
Upvotes: 2
Reputation: 55022
If you have ALL matching properties. That means they are same objects actually. is that correct?
In any case, you want to use a Map/Dictionary/Table with a good hashing algorithm to find matching objects.
Whichever language you are using, you should override the gethashcode and equals methods to implement it.
If you have a good hashing algorithm your access time will be O(1). otherwise it can be upto O(n).
Based on your memory limitation, you want to store foos in the map, storing bars might requite lots of space which you might not have.
Upvotes: 2