MapReduce for same task/different data

Question

We have a system that is made up of multiple PostgreSQL databases. Each database has the same tables, i.e., schema, but only carries a share of the data (and not the full data!).The reason for distributing the data is that our customers run queries that are rather complex and perform up to 100 calculations per row.

By distributing the data to multiple databases, we want to lower the amount of work processed by each database, and ultimately speed up search. At the end, we combine the results of each database to create the final results.

A friend of mine has recommended looking at MapReduce (Hadoop). In my opinion, map-reduce only makes sense if the single workers share the same data but perform different type of work on it (corresponds to multiple instruction, single data).
In our case, however, the workers should perform the same task, but perform that task on various data (corresponds to single instruction, multiple data).

Does MapReduce (Hadoop) make sense for the paradigm same task executed on different data?

Donald Miner · Accepted Answer

Does MapReduce (Hadoop) make sense for the paradigm same task executed on different data?

Yes.

I think you have a misconception about Hadoop and MapReduce. A MapReduce job does indeed work on the same type of data (i.e., "same tables"), but different segments of that data. The parallel Map and Reduce tasks are the same tasks over different portions of the data. MapReduce is most definitely "single instruction, multiple data" from your definition.

Hadoop is by no means a drop-in replacement for a SQL database. They do different things in different ways. Here are some other things to note:

Note that MapReduce is only really going to do batch analytics for you. Things like rollups and counts and aggregates. You won't be able to retrieve or search with MapReduce effectively. Also, updating data in Hadoop is not a typical way you want to do things-- you treat things as more "append only". For any of that, you'll probably want to look at HBase.

Hadoop's file system segments the data for you. From a file system perspective, it'll look like files in folders that contain CSV (or some other file format). Files get split up into blocks, which can then be operated on separately with map tasks. You won't have to manually shard the data like you are now.

Take a look at Hive. It's a abstraction layer on top of MapReduce that interprets a light version of SQL into MapReduce under the covers. It should allow you to convert some of your logic a bit easier.

MapReduce for same task/different data

Answers (1)

Related Questions