hadoop - large database query

Question

Situation: I have a Postgres DB that contains a table with several million rows and I'm trying to query all of those rows for a MapReduce job.

From the research I've done on DBInputFormat, Hadoop might try and use the same query again for a new mapper and since these queries take a considerable amount of time I'd like to prevent this in one of two ways that I've thought up:

1) Limit the job to only run 1 mapper that queries the whole table and call it 
   good.

or

2) Somehow incorporate an offset in the query so that if Hadoop does try to use
   a new mapper it won't grab the same stuff.

I feel like option (1) seems more promising, but I don't know if such a configuration is possible. Option(2) sounds nice in theory but I have no idea how I would keep track of the mappers being made and if it is at all possible to detect that and reconfigure.

Help is appreciated and I'm namely looking for a way to pull all of the DB table data and not have several of the same query running because that would be a waste of time.

Joe K · Accepted Answer

DBInputFormat essentially does already do your option 2. It does use LIMIT and OFFSET in its queries to divide up the work. For example:

Mapper 1 executes: SELECT field1, field2 FROM mytable ORDER BY keyfield LIMIT 100
Mapper 2 executes: SELECT field1, field2 FROM mytable ORDER BY keyfield LIMIT 100 OFFSET 100
Mapper 3 executes: SELECT field1, field2 FROM mytable ORDER BY keyfield LIMIT 100 OFFSET 200

So if you have proper indexes on the key field, you probably shouldn't mind that multiple queries are being run. Where you do get some possible re-work is with speculative execution. Sometimes hadoop will schedule multiples of the same task, and simply only use the output from whichever finishes first. If you wish, you can turn this off with by setting the following property:

mapred.map.tasks.speculative.execution=false

However, all of this is out the window if you don't have a sensible key for which you can efficiently do these ORDER, LIMIT, OFFSET queries. That's where you might consider using your option number 1. You can definitely do that configuration. Set the property:

mapred.map.tasks=1

Technically, the InputFormat gets "final say" over how many Map tasks are run, but DBInputFormat always respects this property.

Another option that you can consider using is a utility called sqoop that is built for transferring data between relational databases and hadoop. This would make this a two-step process, however: first copy the data from Postgres to HDFS, then run your MapReduce job.

hadoop - large database query

Answers (1)

Related Questions