Joins on two large tables using UDF in Hive - performance is too slow

Question

I have two tables in hive. One has around 2 millions of records and other has 14 miliions of records. I am joining these two tables. Also I am applying UDF in WHERE clause. It is taking too much time to perform JOIN operation.

I have tried to run the query for many times but it run for around 2 hrs and still my reducer remains at 70% and after that I am getting exception "java.io.IOException: No space left on device" and job gets killed.

I have tried to set the parameters as below:

set mapreduce.task.io.sort.mb=256;
set mapreduce.task.io.sort.factor=100;
set mapreduce.map.output.compress=true;
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.GzipCodec;
set mapred.child.java.opts=-Xmx1024m;

My Query -

insert overwrite table output select col1, col2, name1, name2, col3, col4, 
t.zip, t.state from table1 m join table2 t ON (t.state=m.state and t.zip=m.zip) 
where matchStrings(concat(name1,'|',name2))>=0.9;

The above query takes 8 mappers and 2 reducers.

Can someone please suggest what do I suppose to do to improve performance.

Joins on two large tables using UDF in Hive - performance is too slow

Answers (1)

Related Questions