how to make hive take only specific files as input from hdfs folder

Question

I have the following folder structure in hdfs

  /input/data/yyyy/mm/dd/

and inside it data files, for example:

/input/data/2013/05/01/
      file_2013_05_01_01.json // file format yyyy_mm_dd_hh
      file_2013_05_01_02.json // file format yyyy_mm_dd_hh
      ....

I've defined hive external table for this folder:

CREATE EXTERNAL TABLE input_data (
    vr INT, ....
)
PARTITIONED BY (tsp STRING)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
STORED AS TEXTFILE;

adding for each folder a partition as following:

   alter table input_data ADD PARTITION (tsp="2013-05-01") LOCATION '/input/data/2013/05/01/';

The following query will take as input all files in date 2013-05-01

select ... from input_data where tps="2013-05-01"

How can I take only files of specific hour? without changing the hdfs structure to put each hour in separate folder?

Tariq · Accepted Answer

You could make use of a virtual column called INPUT__FILE__NAME. It is one of the 2 two virtual columns provided by Hive 0.8.0 and onward and represents the input file's name for a mapper task. So you could do something like this :

select ... from input_data 
where tps="2013-05-01" 
and INPUT__FILE__NAME='file_2013_05_01_01.json';

HTH

how to make hive take only specific files as input from hdfs folder

Answers (2)

Related Questions