Reputation: 6016
I have a dataset consisting of the apps used by users and I'm trying to find out the correlation between different apps. Currently, I have a dataset (a 1GB text file) relating to 200,000 users and 800,000 apps. I'm looking for an efficient way to build a matrix that shows the number of users that have installed particular apps together. For example, consider the following matrix:
app1 app2 app3 app4
app1 0 100 300 50
app2 100 0 350 0
app3 300 350 0 70
app4 50 0 70 0
The value 100 indicates the number of users who have app1 AND app2 installed (simultaneously) on their phone. For creating this matrix, I thought of first sorting/filtering (using Map/Reduce)/aggregating (in MongoDB) the main unsorted data file by userID (O(n)?), then for apps used by each user, having 2 nested loops to count the number of apps installed together (i.e. getting the values in the above matrix). But this is so expensive (O(n^3)?) and unscalable.
Of course I've thouh about multithreading and running the program on some cluster, but I need an scalable and efficient algorithm/tool in the first place (e.g. using dynamic programming?). I'm impressed with the performance of GNUPlot in large datasets, and I was hopeful that I can achieve similar performance.
My preference for the programming language is Java, but I've also thought about R (and related HPC packages) and also Bash, although I'm not familiar with R. I also thought about "graph databases", like Neo4J, but I'm not sure if they suit my problem at hand (I need graph visualization of my data in the future though).
Upvotes: 0
Views: 83
Reputation: 14205
Have you thought through what you're trying to do?
There are 800000 apps. An 800000 x 800000 matrix will take almost 5 terabytes of storage. While this is plenty doable with modern harddisks, surely you have something better to do with all that storage.
It will be better to store sorted list of, for each app, which users have the app installed. This way you can use a set intersection algorithm to compute the entries of the matrix implicitly rather quickly.
You'll need to be more specific about what you're going to do with this data downstream in order to get recommendations on how to represent it.
Upvotes: 2