Barka
Barka

Reputation: 8922

How to match against a large collection efficiently

I have a large collection of objects of type foo. Each object of type foo has say 100 properties (all strings) plus an id. An object of type bar also has these 100 properties.

I want to find the matching object of type foo from the collection where all these properties match with that of bar.

Aside from the brute force method, is there an elegant algorithm where we can calculate a signature for foo objects once and do the same for the bar object and match more efficiently?

The foos are in the thousands and the bars are in the millions.

Upvotes: 0

Views: 242

Answers (3)

Ivan Bianko
Ivan Bianko

Reputation: 1769

Hash is very nice and simple to implement.. But i want suggest you that algorithm:

  1. Map your 100 string properties to one big string(for example concatenate with fixed length for each property) that should unique id of this object. So we have 1000 string in first set, and 1mln strings in second.
  2. The problem reduces to find for each strings in second set if first set contains it.
  3. Make trie data structure on first set
  4. Complicity of checking if string S in the trie is O(|S|). |S| - length of S.

So... Complicity of algorithm is - O(Sum(|Ai|) + Sum(|Bi|)) = O(max(Sum(|Ai|), Sum(|Bi|)) = O(Sum(|Bi|)) for your problem. Ai - string unique id for first set, Bi - string unique id for second set.

UPDATE: Trie takes O(Sum(|Ai|) * |Alphabet|) space at worst.

Upvotes: 0

Kiril
Kiril

Reputation: 40345

Darth Vader has a point there... and I never thought that I'd be siding with the dark side!

I'll go over what I think are the best tools for the trade:

The Embedded Database

The goal of using an embedded database is that you will get performance that will beat most database solutions that you're likely to encounter. We can talk about just how fast LevelDB is, but plenty of other people have already talked about it quite a bit so I won't waste time. The embedded database allows you to store key/value pairs and quickly find them in your database.

The Hashing Function

A good hashing function will be fast and it will provide a good distribution of non-repeatable hashes. CityHash is very fast and it has very good distribution, but again: I won't waste time since a lot of other people have already talked about the performance of CityHash. You would use the hashing function to hash your objects and then use the unique key to look them up in the database.

JSON Serialization

JSON Serialization is the antithesis of what I've shown above: it's very slow and it will diminish any performance gain you achieved with CityHash, but it gives you a very simple way to hash an entire object. You serialize the object to a JSON string, then you hash the string using CityHash. Despite the fact that you've lost the performance gains of CityHash because you spent so much time serializing the object to JSON, you will still reap the benefits of having a really good hashing function.

The Conclusion

  • You can store billions of records in LevelDB and you will be able to quickly retrieve the exact value you're looking for just by providing the hash for it.
  • In order to generate a key, you can use JSON serialization and CityHash to hash the JSON string.
  • Use the key to find the matching object!

Enjoy!

Upvotes: 2

DarthVader
DarthVader

Reputation: 55022

If you have ALL matching properties. That means they are same objects actually. is that correct?

In any case, you want to use a Map/Dictionary/Table with a good hashing algorithm to find matching objects.

Whichever language you are using, you should override the gethashcode and equals methods to implement it.

If you have a good hashing algorithm your access time will be O(1). otherwise it can be upto O(n).

Based on your memory limitation, you want to store foos in the map, storing bars might requite lots of space which you might not have.

Upvotes: 2

Related Questions