Reputation: 10385
In a project I'm currently working reside about 200,000 users. For each of these users we defined a similarity measure with regard to an other user. This yields a similarity matrix of 200000x200000. A tad large. A naive approach (in Ruby) of calculating each entry would take days.
What strategies can I employ to to make computing the matrix fields feasible? In what data store should I put this beast?
Upvotes: 1
Views: 2792
Reputation: 136359
Storing the matrix and especially calculating anything based on it is a nightmare. Likely, your similarity measure uses floats (4 byte). That means the uncompressed storage size is 200000**2 * 4 byte= 160 GB.
There are four conceptual solutions to this problem.
Data Compression:
Data Reduction: You can cluster your users and then build the similarity matrix for clusters. If your clusters have a size of 200 each, you would only have a 1000x1000 matrix and thus only need 4MB to store it. Might have other benefits like speed and robustness as well.
Horizontal Scaling: Use a big machine. Amazon has one with 2TB memory for as little as 3970 USD ;-)
Vertical Scaling: Build block matrices which are chunks of the big matrix's that are ready to handle.
Upvotes: 0
Reputation: 78316
Here are some bits and pieces of an answer, there are still too many gaps in what you've told us to permit a good answer, but you can fill those in yourself. From everything you've told us I don't think that the major part of your task is to efficiently calculate a large similarity matrix, I think that the major parts are to efficiently retrieve values from such a matrix and to efficiently update the matrix.
As we've already determined the matrix is sparse and symmetric; it would be useful to know how sparse. This reduces the storage requirements considerably, but we don't know by how much.
You've told us a bit about updates to user profiles but does your similarity matrix have to be updated as frequently ? My expectation (another assumption) is that similarity measures do not change quickly or sharply when a user modifies his/her profile. From this I hypothesise that working with a similarity measure which is a few minutes (even a few hours) out of date won't do any serious harm.
I think that all this takes us into the domain of databases, which should support fast access to stored similarity measures of the volumes you indicate. I'd be looking to do batch updates of the measures, and only of the measures for users whose profiles have changed, at an interval to suit your demands and availability of computer power.
As for the initial creation of the first version of the similarity matrix, so what if it takes a week in the background, you're only going to do it once.
Upvotes: 3
Reputation: 4698
You probably don't need all the pairs, so I would go for a sparse matrix representation. As for the calculation itself, you can use something like a K-d tree or a Octree (or anything in that family) or any other type of space partitioning method, depending on the properties of your feature set (upon which you calculate similarity) and your similarity measure.
Upvotes: 0
Reputation: 5868
The measure is probably symmetric, so you only need to store half the matrix in the database. But this doesn't help much. You could also avoid storing all pairs with measure zero, if you have lots of them.
Store only the data that will be actually displayed, like the top 10 closest users for each user.
And calculate the similarity measure on the fly for all other user pairs.
Still sounds like a nightmare to keep up to date, maybe even don't store anything.
Upvotes: 0