Reputation: 282
I'm trying to a LEFT OUTER JOIN operation on 2 of my tables in hive. Could understand that we have include filter conditions along with the join conditions in case of joins, imitting them from where conditions to avoid full table scans. Reference: https://gist.github.com/randyzwitch/9abeb66d8637d1a0007c
Inspite of doing this, my query is resulting in a high number of mappers and reducers as if it is doing a full table scan.
Here is my query and explain plan. I'm not good at understanding this explain plan. m.date_id
and d.REC_CREATED_DATE
are the partitioned columns in the respective tables so it should actually scan only these partitions.
Any suggestions to improve my query would be of great help.
hive> EXPLAIN SELECT m.execution_id
> ,m.operation_name
> ,m.return_code
> ,m.explanation
> ,d.REC_CREATED_DATE
> FROM web_log_master m LEFT OUTER JOIN web_log_detail d
> on (m.execution_id = d.execution_id AND m.date_id='2015-07-14' and d.REC_CREATED_DATE='2015-07-14') ;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF (TOK_TABNAME web_log_master) m) (TOK_TABREF (TOK_TABNAME web_log_detail) d) (and (AND (= (. (TOK_TABLE_OR_COL m) execution_id) (. (TOK_TABLE_OR_COL d) execution_id)) (= (. (TOK_TABLE_OR_COL m) date_id) '2015-07-14')) (= (. (TOK_TABLE_OR_COL d) REC_CREATED_DATE) '2015-07-14')))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL m) execution_id)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL m) operation_name)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL m) return_code)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL m) explanation)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL d) REC_CREATED_DATE)))))
STAGE DEPENDENCIES:
Stage-4 is a root stage , consists of Stage-1
Stage-1
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-4
Conditional Operator
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
d
TableScan
alias: d
Reduce Output Operator
key expressions:
expr: execution_id
type: string
sort order: +
Map-reduce partition columns:
expr: execution_id
type: string
tag: 1
value expressions:
expr: rec_created_date
type: string
m
TableScan
alias: m
Reduce Output Operator
key expressions:
expr: execution_id
type: string
sort order: +
Map-reduce partition columns:
expr: execution_id
type: string
tag: 0
value expressions:
expr: execution_id
type: string
expr: operation_name
type: string
expr: return_code
type: string
expr: explanation
type: string
expr: date_id
type: string
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1} {VALUE._col2} {VALUE._col3}
1 {VALUE._col3}
filter predicates:
0 {(VALUE._col13 = '2015-07-14')}
1
handleSkewJoin: false
outputColumnNames: _col0, _col1, _col2, _col3, _col19
Select Operator
expressions:
expr: _col0
type: string
expr: _col1
type: string
expr: _col2
type: string
expr: _col3
type: string
expr: _col19
type: string
outputColumnNames: _col0, _col1, _col2, _col3, _col4
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
Time taken: 13.616 seconds, Fetched: 90 row(s)
Upvotes: 0
Views: 1263
Reputation: 3897
The number of mappers and reducers is dependent on if the job is parallelizable and the capacity of your cluster. If you have many machines you get more mappers and reducers. If you have less machines you get fewer. If the job is not parallelizable then you will get one reducer as in the case here:
select count(distinct column) from x;
Which requires a single reducer when written this way.
In fact you want many mappers and reducers to be working. This is how map reduce scales. Many hands make light work so to speak. In any event your left outer join is working as expected.
Upvotes: 1