MikeKulls
MikeKulls

Reputation: 996

Hadoop Hive query optimisation

I have a table that is partitioned by date in the format yyyyMMdd. If I do a simple query like this:

SELECT COUNT(*) FROM MyTable WHERE Date >= '20140924'

then it will scan 3 days of data (today is the 26th). However I would like my query to always look at the last 3 days, so I write it like this

SELECT COUNT(*) FROM MyTable
WHERE date >= from_unixtime(unix_timestamp() - 259200, 'yyyyMMdd')

The problem is that now it scans every partition. Is there a way to get it to pre-calculate the part of the query after the greater than?

Upvotes: 2

Views: 169

Answers (2)

FuzzyTree
FuzzyTree

Reputation: 32392

One workaround is to create a table with a single row where you can select your calculated time from.

CREATE TABLE dual (dummy STRING);
INSERT INTO TABLE dual SELECT count(*) FROM dual;

SELECT COUNT(*) FROM MyTable t1
JOIN (
    SELECT from_unixtime(unix_timestamp() - 259200, 'yyyyMMdd') myDate
    FROM dual LIMIT 1
) t2 ON (t2.myDate = t1.Date)

Upvotes: 0

Amar
Amar

Reputation: 3845

Unfortunately Hive doesn't support that. I have faced the similar problem in past and my hive table had partitions for last 2 years.

However one workaround you can do is that you can run this hive query inside a shell script where you will compute this date in another variable and use it as a variable in hive query. An example script would be:

#!/bin/bash

date=`date +"%Y%m%d" -d "-3 days"`
hive -e "select count(*) from MyTable where date >= '$date'"

Upvotes: 1

Related Questions