Philippe Girolami
Philippe Girolami

Reputation: 1876

How to parallelize a query in Apache Hive for a (small) dataset

I am testing the latest Hive on parts of my data set. It's only a couple GB of log files that I am reading through a custom SerDe.

When I run simple Group By queries (4 MR jobs), I am getting logs such as

all the while only using one core on the 8 core server. Kind of a waste...

I have activated the parallel option but it still won't parallelize. I have set the number of reduce jobs to be 8.

My expectations is that since my data set is partitionned (=> different files), at least some of the map-reduce phases could be run on parallel on those files.

Is my understanding wrong ? Is there a specific way to write the queries ?

Thanks

Upvotes: 1

Views: 648

Answers (1)

ohshazbot
ohshazbot

Reputation: 904

If your doing nothing but a simple GROUP BY, the only real processing is comparison, which isn't that hard. That said, how many mappers are you running? The tasktrackers will not run parallelized. Rather, hadoop banks on multiple tasktrackers running to parallelize. So if you're only running one map task per node, you won't see anything.

Another possibility is that because your doing a GROUP BY, your bound in IO and not on processor, so there's no need to bring multiple cores into it.

Upvotes: 2

Related Questions