I wrote a very simple distributed computing platform (based on the Map/Reduce paradigm), and I'm in the process of writing some demos and showcases. I have a very small team and have to prioritize which demos I'll write first. To prioritize I need to sort the demos accordingly to about 70% being a relevant, common, significant use case of distributed computing, 30% being easy to write. So far I have it ordered like this: Discovering pi digits with Monte Carlo Numerical integration with Monte Carlo Large matrix multiplication (dense matrices) Linear regressions Large matrix inversion Multiple regressions Sorting Clustering (K-Means) Clustering (Hierarchical) Number 1 is on the list because it took 10 minutes to write, although it's completely useless (I'm not sure but I figure there's not a lot of people trying to find more digits to pi). Due to the nature of my platform, it will shine more in things that are of course embarrassingly parallel, and not I/O-bounded or reduce-dominated. How would you change my list? What would you add to it? Is sorting useful at all in the enterprise world or is it only for benchmarking distributed computing platforms?

hadoopmapreducecluster-computingdistributed-computinghpc

Reputation: 9864

What are the most common uses for distributed computing?

I wrote a very simple distributed computing platform (based on the Map/Reduce paradigm), and I'm in the process of writing some demos and showcases. I have a very small team and have to prioritize which demos I'll write first.

To prioritize I need to sort the demos accordingly to about 70% being a relevant, common, significant use case of distributed computing, 30% being easy to write.

So far I have it ordered like this:

Discovering pi digits with Monte Carlo
Numerical integration with Monte Carlo
Large matrix multiplication (dense matrices)
Linear regressions
Large matrix inversion
Multiple regressions
Sorting
Clustering (K-Means)
Clustering (Hierarchical)

Number 1 is on the list because it took 10 minutes to write, although it's completely useless (I'm not sure but I figure there's not a lot of people trying to find more digits to pi).

Due to the nature of my platform, it will shine more in things that are of course embarrassingly parallel, and not I/O-bounded or reduce-dominated.

How would you change my list? What would you add to it? Is sorting useful at all in the enterprise world or is it only for benchmarking distributed computing platforms?

Upvotes: 1

Answers (2)

Hristo Iliev

Reputation: 74485

I second Mark in that you are mixing distributed computing and HPC. Here are some comments on each of your topics:

(1) There are people trying to compute as many digits of Pi as they can but the Monte Carlo algorithm is completely useless there as its precision scales with the inverse square root of the number of trials, so in order to get one more decimal digit of precision you would roughly need 100 times more trials. There are other algorithms - see if you can implement some of them using Map/Reduce.

(2) This one is fine, although seldom used - same problem with precision as (1).

(5) Pure matrix inversions are seldom performed, mainly because of numerical instabilities. How about solving a dense system of linear equations instead?

I would say that you are missing one of the main usages of M/R processing nowadays, namely graph processing (read: social and other networks/flows analysis). Also some more general optimisation problem might be nice, e.g. genetic algorithms.

Upvotes: 2

High Performance Mark

Reputation: 78364

Your list suggests that you are not distinguishing between parallel computing and distributed computing. This is not necessarily wrong but someone looking for a demonstration of the excellence of a distributed computing platform might be left tepidly enthused upon seeing parallel computations, such as your items 2 - 5, being performed.

Sorting is certainly useful everywhere there is data: large enterprises, small enterprises, in your desk drawers, across the Googlesphere. So too is searching, which is a surprising omission from your list. The other omission which strikes me immediately is any sort of data fusion, merging large datasets to get information from their intersections beyond what can be extracted from the datasets individually.

Upvotes: 4

What are the most common uses for distributed computing?

Answers (2)

Related Questions