Java 8 MapReduce for distributed computing

It made me happy when I heard about parallelStream() in Java 8, that processes on multiple cores and finally gives back the result within single JVM. No more lines of multithreading code. As far as I understand this is valid for single JVM only.

But what if I want to distribute the processing across different JVMs on a single host or even multiple hosts? Does Java 8 include any abstraction for simplifying it?

In a tutorial at dreamsyssoft.com a list of users

private static List<User> users = Arrays.asList(
    new User(1, "Steve", "Vai", 40),
    new User(4, "Joe", "Smith", 32),
    new User(3, "Steve", "Johnson", 57),
    new User(9, "Mike", "Stevens", 18),
    new User(10, "George", "Armstrong", 24),
    new User(2, "Jim", "Smith", 40),
    new User(8, "Chuck", "Schneider", 34),
    new User(5, "Jorje", "Gonzales", 22),
    new User(6, "Jane", "Michaels", 47),
    new User(7, "Kim", "Berlie", 60)
);

is processed to get their average age like this:

double average = users.parallelStream().map(u -> u.age).average().getAsDouble();

In this case it is processed on single host.

My question is: Can it be processed utilizing multiple hosts?

E.g. Host1 processes the list below and returns average1 for five users:

new User(1, "Steve", "Vai", 40),
new User(4, "Joe", "Smith", 32),
new User(3, "Steve", "Johnson", 57),
new User(9, "Mike", "Stevens", 18),
new User(10, "George", "Armstrong", 24),

Similarly Host2 processes the list below and returns average2 for remaining five users:

new User(2, "Jim", "Smith", 40),
new User(8, "Chuck", "Schneider", 34),
new User(5, "Jorje", "Gonzales", 22),
new User(6, "Jane", "Michaels", 47),
new User(7, "Kim", "Berlie", 60)

Finally Host3 computes final result like:

average = (average1 + average2)  / 2

Using distributed architecture it can be solved like remoting. Does Java 8 have some simpler way to solve the problem with some abstraction for it?

I know frameworks like Hadoop, Akka and Promises solve it. I am talking about pure Java 8. Can I get any docummentation and examples for parallelStream() for multiple hosts?

Upvotes: 13

Answers (4)

Glenn

Reputation: 8042

I'm not sure what will happen with Java 8 since it is too early to tell but there are a couple of open source projects that extend the map reduce capabilities of earlier functional programming languages that run in the JVM to distributed computing environments.

Recently, I took a traditional yet non-trivial Hadoop map reduce job (that takes raw performance data and prepares it for loading into an OLAP cube) and rewrote it in both Clojure running on Cascalog and Scala running on Spark. I documented my findings in a blog called Distributed Computing and Functional Programming.

These open source projects are mature and ready for prime time. They are supported by both Cloudera and Hortonworks.

Upvotes: 0

Ophir Yoktan

Reputation: 8449

Don't expect such a feature in the core language, as it requires some kind of server to run and manage the different processes. historically, I don't know of similar solutions that were part of java core.

There are however, some solutions that are similar to what you want. One of them is cascading http://www.cascading.org/ , which is a functional style infrastructure to write map reduce programs. meaning - the actual code if relatively lightweight (unlike traditional map reduce programs) but it does require maintaining an hadoop infrastructure.

Upvotes: 0

Vidya

Reputation: 30320

Here is the list of features scheduled for Java 8 as of September 2013.

As you can see, there is no feature dedicated to standardizing distributed computing over a cluster. The closest you have is JEP 107, which builds on the Fork/Join framework in JDK 7 to leverage multi-core CPU's. In Java 8, you will be able to use lambda expressions to perform bulk operations on collections in parallel by dividing the task among multiple processors.

Java 8 is also scheduled to feature JEP 103, which will also build on Java 7 Fork/Join to sort arrays in parallel. Meanwhile, since Fork/Join is clearly a big deal, it evolves further with JEP 155.

So there are no core Java 8 abstractions for distributed computing over a cluster--only over multiple cores. You will need to devise your own solution for real distributed computing using existing facilities.

As disappointing as that may be, I would point out that there are still wonderful open-source third party abstractions over Hadoop out there like Cascalog and Apache Spark. Spark in particular lets you perform operations on your data in a distributed way through the RDD abstraction, which makes it feel like your data is just in a fancy array.

But you will have to wait for such things in core Java.

Upvotes: 10

prmottajr

Reputation: 1824

There is nothing in the documentations/specs that shows that there will be such a feature. But if we think for a moment RMI is the Java solution for distribution and it is pretty straightforward, you could use it as the base for distribution and on the nodes use the core parallelism as you shown.

Upvotes: 0

Java 8 MapReduce for distributed computing

Answers (4)

Related Questions