Sree Aurovindh
Sree Aurovindh

Reputation: 705

Hadoop or Postgresql for effective processing

I am a student trying to use some of the machine learning algorithms for a large data set.We have about 140 million records in our training set(currently in postgresql tables) and there are five tables with about 6 million records which exhibit primary key - foreign key relationships.

We just have 2 machines with the following configurations 1) 6GB ram with 2nd generation i5 processor 2) 8GB ram with 2nd generation i7 processor

We are right now planning to split them into logical groupings before running our statistical analysis since the turnaround time is quite high.

1) Should i split them into separate tables in postgresql and them use MATLAB or R for programming OR 2) Should I use hadoop with hbase by porting the database 3) Should I combine and use them(i.e) decompose them based on logical groups and dump in postgresql database and also setup hadoop +hbase for analysis and use it based on necessary algorithms.

Thanks

Upvotes: 1

Views: 769

Answers (1)

David Gruzman
David Gruzman

Reputation: 8088

It is hard to believe that in such small cluster Hadoop will be effective. If you can effectively parralelize task without it - it will be more effective almost for sure
Another consideration I would take into account - what is iteration time in your learning process. If iteration takes dozens of seconds - then Hadoop job overhead (which is about 30 seconds) will be too much.
What you do can get from Hadoop - is effective external parralel sort - it is what shuffle stage is. If you need it - consider using hadoop.
Please also note that in general it is not easy to port relational schema to HBase - since joins are not supported.

Upvotes: 2

Related Questions