Javad
Javad

Reputation: 6016

efficient solution for dealing with huge datasets

I have a dataset consisting of the apps used by users and I'm trying to find out the correlation between different apps. Currently, I have a dataset (a 1GB text file) relating to 200,000 users and 800,000 apps. I'm looking for an efficient way to build a matrix that shows the number of users that have installed particular apps together. For example, consider the following matrix:

      app1  app2  app3  app4
app1  0     100   300   50     
app2  100   0     350   0
app3  300   350   0     70
app4  50    0     70    0

The value 100 indicates the number of users who have app1 AND app2 installed (simultaneously) on their phone. For creating this matrix, I thought of first sorting/filtering (using Map/Reduce)/aggregating (in MongoDB) the main unsorted data file by userID (O(n)?), then for apps used by each user, having 2 nested loops to count the number of apps installed together (i.e. getting the values in the above matrix). But this is so expensive (O(n^3)?) and unscalable.

Of course I've thouh about multithreading and running the program on some cluster, but I need an scalable and efficient algorithm/tool in the first place (e.g. using dynamic programming?). I'm impressed with the performance of GNUPlot in large datasets, and I was hopeful that I can achieve similar performance.

My preference for the programming language is Java, but I've also thought about R (and related HPC packages) and also Bash, although I'm not familiar with R. I also thought about "graph databases", like Neo4J, but I'm not sure if they suit my problem at hand (I need graph visualization of my data in the future though).

Upvotes: 0

Views: 83

Answers (1)

tmyklebu
tmyklebu

Reputation: 14205

Have you thought through what you're trying to do?

There are 800000 apps. An 800000 x 800000 matrix will take almost 5 terabytes of storage. While this is plenty doable with modern harddisks, surely you have something better to do with all that storage.

It will be better to store sorted list of, for each app, which users have the app installed. This way you can use a set intersection algorithm to compute the entries of the matrix implicitly rather quickly.

You'll need to be more specific about what you're going to do with this data downstream in order to get recommendations on how to represent it.

Upvotes: 2

Related Questions