Paul C
Paul C

Reputation: 8457

Distributed data aggregations, querying, filtering: any alternative frameworks to Hadoop/Mapreduce? (MR is too slow)

We're planning on putting a lot of metric data into some sort of nosql db, probably cassandra, maybe something else, across several servers.

We want to run calculations over the data, in a map reduce style (aggregate the data on the machine where it lives, then combine the results).

I made a POC using Cassandra and Hadoop and mapreduce. The overhead starting the mapreduce jobs and getting the results was too high for our needs.

Before we go roll our own, are there any other distributed java frameworks out there that emphasize performance?

Upvotes: 3

Views: 746

Answers (4)

Mairbek Khadikov
Mairbek Khadikov

Reputation: 8089

Take a look a storm.

From documentation:

Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, and is a lot of fun to use!

Upvotes: 2

Matt Stephenson
Matt Stephenson

Reputation: 8620

Look at Oracle Coherence, a distributed cache that allows one to partition data among VMs, aggregate and calculate in parallel, and scale horizontally.

Upvotes: 2

Praveen Sripati
Praveen Sripati

Reputation: 33495

Before we go roll our own, are there any other distributed java frameworks out there that emphasize performance? - every framework will try to emphasize on performance as one of the dimension.

I made a POC using Cassandra and Hadoop and mapreduce. The overhead starting the mapreduce jobs and getting the results was too high for our needs.

Cassandra is one of the input source type for MR. Using MR will involve time for the map tasks to start/complete, shuffling and the reduce tasks to start/complete. MR is designed for batch processing and not for instantaneous results. Some level of tuning can be done, but you should be looking for real time or stream processing framework.

Take a look at HStreaming (Note that I haven't used it)

HStreaming enables to use the same MapReduce and Apache Pig algorithms and functions for real-time or batch processing. Existing code such as user-defined functions (UDF) can be migrated to stream processing with no or minimal changes. It brings your business a rapid development cycle and gives you the agility to adapt fast to changing business requirements.

Upvotes: 1

Paul C
Paul C

Reputation: 8457

I see the commercial column-store database vertica has functionality similar to map reduce. Though you express your aggregations with their version of SQL statements. I'm sure this product is not cheap, though...

Upvotes: 0

Related Questions