Reputation: 875

automatically partition Hive tables based on S3 directory names

I have data stored in S3 like:

/bucket/date=20140701/file1
/bucket/date=20140701/file2
...
/bucket/date=20140701/fileN

/bucket/date=20140702/file1
/bucket/date=20140702/file2
...
/bucket/date=20140702/fileN
...

My understanding is that if I pull in that data via Hive, it will automatically interpret date as a partition. My table creation looks like:

CREATE EXTERNAL TABLE search_input(
   col 1 STRING,
   col 2 STRING,
   ...

)
PARTITIONED BY(date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3n://bucket/';

However Hive doesn't recognize any data. Any queries I run return with 0 results. If I instead just grab one of the dates via:

CREATE EXTERNAL TABLE search_input_20140701(
   col 1 STRING,
   col 2 STRING,
   ...

)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3n://bucket/date=20140701';

I can query data just fine.

Why doesn't Hive recognize the nested directories with the "date=date_str" partition? Is there a better way to have Hive run a query over multiple sub-directories and slice it based on a datetime string?

Upvotes: 6

Answers (2)

Nikhil Patil

Reputation: 21

I had faced the same issue and realized that hive does not have partitions metadata with it. So we need to add that metadata using ALTER TABLE ADD PARTITION query. It becomes tedious, if you have few hundred partitions to create same queries with different values.

ALTER TABLE <table name> ADD PARTITION(<partitioned column name>=<partition value>);

Once you run above query for all available partitions. You should see the results in hive queries.

Upvotes: 0

gallamine

Reputation: 875

In order to get this to work I had to do 2 things:

Enable recursive directory support:

SET mapred.input.dir.recursive=true;
SET hive.mapred.supports.subdirectories=true;

For some reason it would still not recognize my partitions so I had to recover them via:

ALTER TABLE search_input RECOVER PARTITIONS;

You can use:

SHOW PARTITIONS table;

to check and see that they've been recovered.

Upvotes: 6

automatically partition Hive tables based on S3 directory names

Answers (2)

Related Questions