prashant1988
prashant1988

Reputation: 282

Full table scan issue with LEFT OUTER JOIN in Hive

I'm trying to a LEFT OUTER JOIN operation on 2 of my tables in hive. Could understand that we have include filter conditions along with the join conditions in case of joins, imitting them from where conditions to avoid full table scans. Reference: https://gist.github.com/randyzwitch/9abeb66d8637d1a0007c

Inspite of doing this, my query is resulting in a high number of mappers and reducers as if it is doing a full table scan.

Here is my query and explain plan. I'm not good at understanding this explain plan. m.date_id and d.REC_CREATED_DATE are the partitioned columns in the respective tables so it should actually scan only these partitions.

Any suggestions to improve my query would be of great help.

hive> EXPLAIN SELECT m.execution_id
> ,m.operation_name
> ,m.return_code
> ,m.explanation
> ,d.REC_CREATED_DATE
> FROM web_log_master m  LEFT OUTER JOIN web_log_detail d
> on (m.execution_id = d.execution_id AND m.date_id='2015-07-14' and d.REC_CREATED_DATE='2015-07-14') ;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF (TOK_TABNAME web_log_master) m) (TOK_TABREF (TOK_TABNAME web_log_detail) d) (and (AND (= (. (TOK_TABLE_OR_COL m) execution_id) (. (TOK_TABLE_OR_COL d) execution_id)) (= (. (TOK_TABLE_OR_COL m) date_id) '2015-07-14')) (= (. (TOK_TABLE_OR_COL d) REC_CREATED_DATE) '2015-07-14')))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL m) execution_id)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL m) operation_name)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL m) return_code)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL m) explanation)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL d) REC_CREATED_DATE)))))

STAGE DEPENDENCIES:
  Stage-4 is a root stage , consists of Stage-1
  Stage-1
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-4
  Conditional Operator

  Stage: Stage-1
Map Reduce
  Alias -> Map Operator Tree:
    d
      TableScan
        alias: d
        Reduce Output Operator
          key expressions:
                expr: execution_id
                type: string
          sort order: +
          Map-reduce partition columns:
                expr: execution_id
                type: string
          tag: 1
          value expressions:
                expr: rec_created_date
                type: string
    m
      TableScan
        alias: m
        Reduce Output Operator
          key expressions:
                expr: execution_id
                type: string
          sort order: +
          Map-reduce partition columns:
                expr: execution_id
                type: string
          tag: 0
          value expressions:
                expr: execution_id
                type: string
                expr: operation_name
                type: string
                expr: return_code
                type: string
                expr: explanation
                type: string
                expr: date_id
                type: string
  Reduce Operator Tree:
    Join Operator
      condition map:
           Left Outer Join0 to 1
      condition expressions:
        0 {VALUE._col0} {VALUE._col1} {VALUE._col2} {VALUE._col3}
        1 {VALUE._col3}
      filter predicates:
        0 {(VALUE._col13 = '2015-07-14')}
        1
      handleSkewJoin: false
      outputColumnNames: _col0, _col1, _col2, _col3, _col19
      Select Operator
        expressions:
              expr: _col0
              type: string
              expr: _col1
              type: string
              expr: _col2
              type: string
              expr: _col3
              type: string
              expr: _col19
              type: string
        outputColumnNames: _col0, _col1, _col2, _col3, _col4
        File Output Operator
          compressed: false
          GlobalTableId: 0
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1


Time taken: 13.616 seconds, Fetched: 90 row(s)

Upvotes: 0

Views: 1263

Answers (1)

invoketheshell
invoketheshell

Reputation: 3897

The number of mappers and reducers is dependent on if the job is parallelizable and the capacity of your cluster. If you have many machines you get more mappers and reducers. If you have less machines you get fewer. If the job is not parallelizable then you will get one reducer as in the case here:

select count(distinct column) from x;

Which requires a single reducer when written this way.

In fact you want many mappers and reducers to be working. This is how map reduce scales. Many hands make light work so to speak. In any event your left outer join is working as expected.

Upvotes: 1

Related Questions