RData
RData

Reputation: 969

Hive import only certain file types from hdfs

I’m creating a external table using Serde

org.apache.hive.hcatalog.data.JsonSerde

Hdfs folder location has multiple file formats and I would like to import only json files types . I tried to use

**SERDEPROPERTIES (“input.regex” = “*.json”)** 

But does not seem to work

Upvotes: 1

Views: 202

Answers (1)

leftjoin
leftjoin

Reputation: 38335

Hive has virtual column named INPUT__FILE__NAME and you can filter it in the where clause.

select * from --do everything else on this level
(select * from your_table --do filtering in the subquery wrapper
where INPUT__FILE__NAME rlike '\\.json$' --ends with .json
)s

I'm not sure it will efficiently work because these related Jiras unfortunately are not implemented yet: Add file pruning into Hive and Selectively include EXTERNAL TABLE source files via REGEX

If this solution will work slow because mappers will read files, you may need to copy files required to the separate folder and build a table on top of it.

Have found one more solution which you can possibly use: SymlinkTextInputFormat

Create some new location with a file containing a list of all files required and build a table on top of this location.

For example, the location of the table is "/user/hive/mytable". There is a file called "/user/hive/mytable/myfile.txt". Inside the file, there are 2 lines, "/user/myname/textfile1.txt" and "/user/myname/textfile2.txt"

We can do:

CREATE TABLE mytable (...) STORED AS INPUTFORMAT 'org.apache.hadoop.hive.io.SymlinkTextInputFormat' LOCATION '/user/hive/mytable';
SELECT * FROM mytable;

which will return the content of the 2 files: "/user/myname/textfile1.txt" and "/user/myname/textfile2.txt"

Upvotes: 0

Related Questions