Reputation: 5892
I have the following folder structure in hdfs
/input/data/yyyy/mm/dd/
and inside it data files, for example:
/input/data/2013/05/01/
file_2013_05_01_01.json // file format yyyy_mm_dd_hh
file_2013_05_01_02.json // file format yyyy_mm_dd_hh
....
I've defined hive external table for this folder:
CREATE EXTERNAL TABLE input_data (
vr INT, ....
)
PARTITIONED BY (tsp STRING)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
STORED AS TEXTFILE;
adding for each folder a partition as following:
alter table input_data ADD PARTITION (tsp="2013-05-01") LOCATION '/input/data/2013/05/01/';
The following query will take as input all files in date 2013-05-01
select ... from input_data where tps="2013-05-01"
How can I take only files of specific hour? without changing the hdfs structure to put each hour in separate folder?
Upvotes: 5
Views: 4279
Reputation: 12010
You could make use of the following construct:
SELECT
*
FROM
my_input_data
WHERE
INPUT__FILE__NAME LIKE '%hh.json';
Here hh
is your desired hour and INPUT__FILE__NAME
is the virtual column available to hive queries while processing a given file.
Upvotes: 3
Reputation: 34184
You could make use of a virtual column called INPUT__FILE__NAME
. It is one of the 2 two virtual columns provided by Hive 0.8.0 and onward and represents the input file's name for a mapper task. So you could do something like this :
select ... from input_data
where tps="2013-05-01"
and INPUT__FILE__NAME='file_2013_05_01_01.json';
HTH
Upvotes: 7