Vijiy
Vijiy

Reputation: 1197

Hive Distinct Query takes time when we have more files

Table Structure -

hive> desc table1;
OK
col1 string
col2 string
col3 string
col4 bigint
col5 string
Time taken: 0.454 seconds, Fetched: 5 row(s);

NUmber of underlying files -

[user@localhost ~]$ hadoop fs -ls /user/hive/warehouse/database.db/table | wc -l
58822
[user@localhost ~]$

Distinct Query - select distinct concat(col1,'~',col2,'~',col3) from vn_req_tab;

Total records - ~2M Above query runs for 8 hours.

What is causing the issue, How do i debug this query.

Upvotes: 2

Views: 687

Answers (1)

Strick
Strick

Reputation: 1642

You have very large number of small files and this is the main problem. When you are executing the query 1 mapper is executing on each file thus there are lot of mappers are running each mapper on small piece of data(1 file each) which means they are consuming unnecessary resources from cluster and waits for others to finish.

Please note hadoop is ideal for bigger files with large data.

If you would have executed the same query on bigger files it would have given much better performance.

Try setting the below properties

set mapred.min.split.size=100000000; // u can set max.split.size for optimal performance.

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

set mapred.min.split.size=100000000;

Try tweaking the values in property to reach at optimal number of mappers

Upvotes: 2

Related Questions