Julias
Julias

Reputation: 5892

how to make hive take only specific files as input from hdfs folder

I have the following folder structure in hdfs

  /input/data/yyyy/mm/dd/

and inside it data files, for example:

/input/data/2013/05/01/
      file_2013_05_01_01.json // file format yyyy_mm_dd_hh
      file_2013_05_01_02.json // file format yyyy_mm_dd_hh
      ....

I've defined hive external table for this folder:

CREATE EXTERNAL TABLE input_data (
    vr INT, ....
)
PARTITIONED BY (tsp STRING)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
STORED AS TEXTFILE;

adding for each folder a partition as following:

   alter table input_data ADD PARTITION (tsp="2013-05-01") LOCATION '/input/data/2013/05/01/';

The following query will take as input all files in date 2013-05-01

select ... from input_data where tps="2013-05-01"

How can I take only files of specific hour? without changing the hdfs structure to put each hour in separate folder?

Upvotes: 5

Views: 4279

Answers (2)

Amar
Amar

Reputation: 12010

You could make use of the following construct:

SELECT 
   *
FROM
   my_input_data
WHERE
   INPUT__FILE__NAME LIKE '%hh.json';

Here hh is your desired hour and INPUT__FILE__NAME is the virtual column available to hive queries while processing a given file.

Upvotes: 3

Tariq
Tariq

Reputation: 34184

You could make use of a virtual column called INPUT__FILE__NAME. It is one of the 2 two virtual columns provided by Hive 0.8.0 and onward and represents the input file's name for a mapper task. So you could do something like this :

select ... from input_data 
where tps="2013-05-01" 
and INPUT__FILE__NAME='file_2013_05_01_01.json';

HTH

Upvotes: 7

Related Questions