Eddie
Eddie

Reputation: 179

Realistic Social Network Model with tens of millions Users. Which technologies should I be using?

I am attempting to model a realistic social network (Facebook). I am a Computer Science Graduate student so I have a grasp on basic data structures and algorithms.

The Idea:
I began this project in java. My idea is to create multiple Areas of Users. Each User in a given area will have a random number of friends with a normal distribution around a given mean. Each User will have a large percentage or cluster of "Friends" from the Area that they belong to. The remainder of their "Friends" will be smaller clusters from a few different random Areas.

Initial Structure
I wanted to create an ArrayList of areas
ArrayList<Area> areas
With each Area holding an ArrayList of Users
ArrayList<User> users
And each User holding an ArrayList of "Friends"
ArrayList<User> friends

From there I can go through each Area, and each User in that Area and give that user most of their friends from that Area, as well as a few friends from a few random Areas. This is easy enough as long as my data set remains small.

The problem:
When I try to create large data sets, I get an OutOfMemoryError due to no more memory in the heap. I now realize that this way of doing it will be impossible if I want to create, say, 30 Area's with 1 millions users per area, and 200 friends per User. I eat up almost 2gb with 1 Area...So now what. My algorithm would work if I could create all the users ahead of time, then simply "give" friends to each user. But I need the Areas and Users created first. There needs to be a User in an Area before it can be made a "friend".

Next Step:
I like my algorithm, it is simple and easy to understand. What I need is a better way to store this data, since it cant be stored and held in memory all at once. I am going to need to not only access the Area a user belongs too, but also a few random areas as well, for each user.

My Questions:
1. What technology/data structure should I be putting this data into. In the end I basically want a User->Friends relationship. The "Area" idea is a way to make this relationship realistic.
2. Should I be using a different language all together. I know that technologies such as Lucene, Hadoop, etc. were created with Java, and are used for large amounts of data...But I have never used them and would like some guidance before I dive into something new.
3. Where should I begin? Obviously I cannot use only java with the data in memory. But I also need to create these Areas of Users before I can give a User a list of Friends.

Sorry for the semi-long read, but I wanted to lay out exactly where I am so you could guide me in the right direction. Thank you to everyone that took the time to read/help me with this topic.

Upvotes: 3

Views: 825

Answers (4)

swisscheese
swisscheese

Reputation: 1775

Well you are not breaking new ground here and there are a lot of existing models that you can pull great amounts of information from and tailor to suit your needs. Especially if you are open to the technologies used. I understand your desire to have it fill this huge number from the start but keep in mind a solid foundation can be built upon and changed as needed without a complete rewrite.

There is some good info and many links to additional good info as to what FB, LinkedIn, Digg, and others are doing here at Stackoverflow question 1009025

Upvotes: 0

Michael Brown
Michael Brown

Reputation: 9153

You need a searchable storage solution to hold your data (rather than holding it all in memory). Either a relational database (such as Oracle, MySQL, or SQL Server) with an O/RM (such as Hibernate) or a nosql database such as mongodb will work just fine.

Upvotes: 2

James Black
James Black

Reputation: 41858

There is probably no benefit keeping it all in memory, unless you are planning on using every node in some visual algorithm to display relationships.

So, if you use a database then you can build your relationships, give random demographic information, if you want to model that also, and then it is a matter of just writing your queries.

But, if you do need a large amount of data then by using 64-bit Java then you can set the memory to a much larger number, depending on what is on your computer.

So, once you built your relationships, then you can begin to write the queries to relate the information in different ways.

You may want to look at using Lists instead of Arrays, when sizes are different, so that you aren't wasting memory when you read the data back. I expect that is the main reason you are running out of memory, if you assume that there are 100 users and the largest number of friends for any of these is 50, but most will have 10, then for the vast majority of users you are wasting space, especially when you are dealing with millions, as the pointer for each object will become non-trivial.

You may want to re-examine your data structures, I expect you have some ineffiencies there.

You may also want to use some monitoring tools, and this page may help: http://www.scribd.com/doc/42817553/Java-Performance-Monitoring

Even something as simple as jconsole would help you to see what is going on with your application.

Upvotes: 0

Puspendu Banerjee
Puspendu Banerjee

Reputation: 2651

  1. Use a database with some ORM tool[JPA with Hibernate etc.] ,
  2. Load data Lazily, when they are really needed
  3. Unload them when them from Cache/Session when they are not really required or inactive.

Feel comfortable to let me know in case there is any difficulty to understand.

http://puspendu.wordpress.com/

Upvotes: 0

Related Questions