Reputation: 550
WRT a 3 node cluster composed of c3.2xlarge instances.
I have two tables. Table U has about 65 Million records and contains among other fields latitude and longitude. Table L has about 1 million records and also contains latitude and longitude.
U is stored as an ORC table.
Task is to determine how many U records fall within a 10 mile radius of locations in L.
select l.id, count(u.id) from U u, L l where 3960 *
acos(cos(radians(l.lat)) * cos(radians(u.lat)) * cos(radians(l.long) -
radians(u.long)) + sin(radians(l.lat)) * sin(radians(u.lat))) < 10.0
group by l.id;
The bit 3960 * acos(cos(radians(l.lat)) * cos(radians(u.lat)) *
cos(radians(l.long) - radians(u.long)) + sin(radians(l.lat)) *
sin(radians(u.lat))) < 10.0
is just the distance between the lat/long
pairs should be less than 10 miles constraint.
Problem: The query seems to take forever to complete. While the map phase completes relatively quickly the reduce phase gets stuck at some fixed percentage (80% ish)
I noticed this in the output messages that Hive emits. Number of reduce tasks determined at compile time: 1
I tried increasing the number of reducers by setting mapred.reduce.tasks to 7 but it always ends up as 1. I was not successful in increasing the number of reducers.
This answer seems to suggest that perhaps if I write my query in a different way then I can force more than 1 reducer. But I haven't been able to do that yet.
Execution Time estimates :For a single location in L it takes about 60 seconds to get an answer. By that account it should take 60 million seconds which is around 700 days! Should it take so much time? Even for Hadoop.
I have also tried putting addional constraints like bounding the lat, long in a 10 mile by 10 mile square box with the location in L at the centre of the box but the time taken is now 40 seconds for 1 location which is not a vast improvement.
Questions:
1) How can I force more number of reducers? 2) Is there a better(in terms of execution time) query? 3) Any other advice that will help me solve this problem.
Version: Hadoop - 2.7.0 Java 1.7.0_80 Hive 1.2.1
Upvotes: 5
Views: 1755
Reputation: 38950
As told earlier,
mapred.reduce.tasks replaced with mapreduce.job.reduces
Have a look at all deprecated attributes and new values for those deprecated values from this article.
Apart from deprecated attributes, you have to migrate some of the functions as well as explained in this article
Upvotes: 1
Reputation: 4251
If you're using Hadoop 2, the right way to specify reducers is:
SET mapreduce.job.reduces=7;
This is related to changes made in Hadoop 2. I had the same behavior you're describing, updated to the above command, and was successful in setting the reduce tasks. I'm not sure how Hadoop chooses the number of reducers necessary, but it does seem to go small in some situations, and that can cause significant performance lags.
Upvotes: 0