user1954337
user1954337

Reputation: 51

which sql is better on Hbase with phoenix?

I make two tables by phoenix on Hbase.

One is ORIGIN_LOG, the other ORIGIN_LOG_INDEX.

In ORIGIN_LOG, the key is info_key. In ORIGIN_LOG_INDEX, the key is (log_t, zone)

And we save log_t, zone, info_key in ORIGIN_LOG_INDEX, so that we can search info_key very quickly by log_t and zone in ORIGIN_LOG_INDEX. Then using info_key, we can get detail log info from ORIGIN_LOG by info_key because info_key is the key for ORIGIN_LOG.

But when we explain the following sql. we find it will cost full scan on ORIGIN_LOG.

explain select "log_t", "app_ver", "device_id", "mobage_uid",     "param1","param2","param3", "param4" , "param5", "user_id", "a_typ", "a_tar", "a_rst"  from "ORIGIN_LOG" where "info_key" in (select distinct "info_key" from "ORIGIN_LOG_INDEX" where  "log_t">='1423956600' and  "log_t"<'1423956601' and  "zone" ='18')



    CLIENT 4-CHUNK PARALLEL 4-WAY FULL SCAN OVER ORIGIN_LOG 
    CLIENT MERGE SORT                        |
    |     SKIP-SCAN-JOIN TABLE 0               |
    |         CLIENT 2-CHUNK PARALLEL 2-WAY SKIP SCAN ON 2 RANGES OVER         
    ORIGIN_LOG_INDEX [0,'1423956600','18'] - [1,'1423956601','18'] |
    |             SERVER FILTER BY FIRST KEY ONLY |
    |             SERVER AGGREGATE INTO DISTINCT ROWS BY [info_key] |
    |         CLIENT MERGE SORT                |
    |     DYNAMIC SERVER FILTER BY info_key IN ($5.$7) |

If we only use ORIGIN_LOG with conditions log_t and zone, as following:

select "log_t", "app_ver", "device_id", "mobage_uid", "param1","param2","param3", "param4" , "param5", "user_id", "a_typ", "a_tar", "a_rst"  from "ORIGIN_LOG"  where  "log_t">='1423956600' and  "log_t"<'1423956601' and  "zone" ='18';

We also get full scan.

CLIENT 4-CHUNK PARALLEL 4-WAY FULL SCAN OVER ORIGIN_LOG |
|     SERVER FILTER BY (log_t >= '1423956600' AND log_t < '1423956601' AND  zone = '18') |
| CLIENT MERGE SORT                        |

So what the difference between two sql. And which sql is better for the performance.

Thank you.

BRs

Upvotes: 2

Views: 699

Answers (1)

Anil Gupta
Anil Gupta

Reputation: 1126

Your first query is range base scan of HBase on ORIGIN_LOG_INDEX and then Gets on ORIGIN_LOG. Your second query is a range based scan in HBase where you would provide a "startkey" and "endkey" for scan. Second query is much better because you are avoiding lookup into another table and you are also not doing distinct operation.
However, it is possible that startKey and endkey range might span entire table. So, the worst case of your scan is "FULL TABLE" scan. Hence, i think, the explain plan is showing it as a full table scan.Maybe, you can ask on the mailing list for further clarification.

Upvotes: 2

Related Questions