Reputation: 969
I’m creating a external table using Serde
org.apache.hive.hcatalog.data.JsonSerde
Hdfs folder location has multiple file formats and I would like to import only json files types . I tried to use
**SERDEPROPERTIES (“input.regex” = “*.json”)**
But does not seem to work
Upvotes: 1
Views: 202
Reputation: 38335
Hive has virtual column named INPUT__FILE__NAME and you can filter it in the where clause.
select * from --do everything else on this level
(select * from your_table --do filtering in the subquery wrapper
where INPUT__FILE__NAME rlike '\\.json$' --ends with .json
)s
I'm not sure it will efficiently work because these related Jiras unfortunately are not implemented yet: Add file pruning into Hive and Selectively include EXTERNAL TABLE source files via REGEX
If this solution will work slow because mappers will read files, you may need to copy files required to the separate folder and build a table on top of it.
Have found one more solution which you can possibly use: SymlinkTextInputFormat
Create some new location with a file containing a list of all files required and build a table on top of this location.
For example, the location of the table is "/user/hive/mytable"
.
There is a file called "/user/hive/mytable/myfile.txt".
Inside the file, there are 2 lines, "/user/myname/textfile1.txt"
and "/user/myname/textfile2.txt"
We can do:
CREATE TABLE mytable (...) STORED AS INPUTFORMAT 'org.apache.hadoop.hive.io.SymlinkTextInputFormat' LOCATION '/user/hive/mytable';
SELECT * FROM mytable;
which will return the content of the 2 files: "/user/myname/textfile1.txt" and "/user/myname/textfile2.txt"
Upvotes: 0