Reputation: 10385

Efficiently calculate large similarity matrix

In a project I'm currently working reside about 200,000 users. For each of these users we defined a similarity measure with regard to an other user. This yields a similarity matrix of 200000x200000. A tad large. A naive approach (in Ruby) of calculating each entry would take days.

What strategies can I employ to to make computing the matrix fields feasible? In what data store should I put this beast?

Upvotes: 1

Answers (4)

Martin Thoma

Reputation: 136359

Storing the matrix and especially calculating anything based on it is a nightmare. Likely, your similarity measure uses floats (4 byte). That means the uncompressed storage size is 200000**2 * 4 byte= 160 GB.

There are four conceptual solutions to this problem.

Data Compression:

Easiest: Use char as data type (loss of information, reduces size by factor 4 - don't forget to scale your data to the new range!)
Use symmetry: Only store half of the matrix. But then it becomes a nightmare to do operations on it
Use compression algorithms. Pro: Can always be applied. Con: Will make any operation slower.

Data Reduction: You can cluster your users and then build the similarity matrix for clusters. If your clusters have a size of 200 each, you would only have a 1000x1000 matrix and thus only need 4MB to store it. Might have other benefits like speed and robustness as well.

Horizontal Scaling: Use a big machine. Amazon has one with 2TB memory for as little as 3970 USD ;-)

Vertical Scaling: Build block matrices which are chunks of the big matrix's that are ready to handle.

Upvotes: 0

High Performance Mark

Reputation: 78316

Here are some bits and pieces of an answer, there are still too many gaps in what you've told us to permit a good answer, but you can fill those in yourself. From everything you've told us I don't think that the major part of your task is to efficiently calculate a large similarity matrix, I think that the major parts are to efficiently retrieve values from such a matrix and to efficiently update the matrix.

As we've already determined the matrix is sparse and symmetric; it would be useful to know how sparse. This reduces the storage requirements considerably, but we don't know by how much.

You've told us a bit about updates to user profiles but does your similarity matrix have to be updated as frequently ? My expectation (another assumption) is that similarity measures do not change quickly or sharply when a user modifies his/her profile. From this I hypothesise that working with a similarity measure which is a few minutes (even a few hours) out of date won't do any serious harm.

I think that all this takes us into the domain of databases, which should support fast access to stored similarity measures of the volumes you indicate. I'd be looking to do batch updates of the measures, and only of the measures for users whose profiles have changed, at an interval to suit your demands and availability of computer power.

As for the initial creation of the first version of the similarity matrix, so what if it takes a week in the background, you're only going to do it once.

Upvotes: 3

enobayram

Reputation: 4698

You probably don't need all the pairs, so I would go for a sparse matrix representation. As for the calculation itself, you can use something like a K-d tree or a Octree (or anything in that family) or any other type of space partitioning method, depending on the properties of your feature set (upon which you calculate similarity) and your similarity measure.

Upvotes: 0

Adder

Reputation: 5868

The measure is probably symmetric, so you only need to store half the matrix in the database. But this doesn't help much. You could also avoid storing all pairs with measure zero, if you have lots of them.

Store only the data that will be actually displayed, like the top 10 closest users for each user.

And calculate the similarity measure on the fly for all other user pairs.

Still sounds like a nightmare to keep up to date, maybe even don't store anything.

Upvotes: 0

Efficiently calculate large similarity matrix

Answers (4)

Related Questions