Hadoop Hive query optimisation

Question

I have a table that is partitioned by date in the format yyyyMMdd. If I do a simple query like this:

SELECT COUNT(*) FROM MyTable WHERE Date >= '20140924'

then it will scan 3 days of data (today is the 26th). However I would like my query to always look at the last 3 days, so I write it like this

SELECT COUNT(*) FROM MyTable
WHERE date >= from_unixtime(unix_timestamp() - 259200, 'yyyyMMdd')

The problem is that now it scans every partition. Is there a way to get it to pre-calculate the part of the query after the greater than?

Amar · Accepted Answer

Unfortunately Hive doesn't support that. I have faced the similar problem in past and my hive table had partitions for last 2 years.

However one workaround you can do is that you can run this hive query inside a shell script where you will compute this date in another variable and use it as a variable in hive query. An example script would be:

#!/bin/bash

date=`date +"%Y%m%d" -d "-3 days"`
hive -e "select count(*) from MyTable where date >= '$date'"

Hadoop Hive query optimisation

Answers (2)

Related Questions