Reputation: 3124
Having a data frame store in parquet format and partitioned physically over different files based on values from one or more columns in a filesystem structure like the following:
|-region=Europe
| |-processed=1
| | |-part-r-00000-0515bd21-1cf1-40f1-885e-055a29284961.gz.parquet
| |-processed=2
| | |-part-r-00000-0515bd21-1cf1-40f1-885e-055a29284961.gz.parquet
| |-processed=3
| | |-part-r-00000-0515bd21-1cf1-40f1-885e-055a29284961.gz.parquet
| |-processed=4
| |-part-r-00000-0515bd21-1cf1-40f1-885e-055a29284961.gz.parquet
|-region=Asia
| |-processed=2
| | |-part-r-00000-0515bd21-1cf1-40f1-885e-055a29284961.gz.parquet
| |-processed=4
| |-part-r-00000-0515bd21-1cf1-40f1-885e-055a29284961.gz.parquet
|-region=America
| |-processed=3
| |-part-r-00000-0515bd21-1cf1-40f1-885e-055a29284961.gz.parquet
|-_SUCCESS
To make the schema aware of the partitions I'm adding each partition manually:
ALTER TABLE status_log ADD PARTITION (region='Europe', processed="1") LOCATION '/hdfs/status_logs/region=Europe/processed=1';
ALTER TABLE status_log ADD PARTITION (region='Europe', processed="2") LOCATION '/hdfs/status_logs/region=Europe/processed=2';
ALTER TABLE status_log ADD PARTITION (region='Europe', processed="4") LOCATION '/hdfs/status_logs/region=Europe/processed=2';
...
..
.
Is there way to define the table schema such that Impala it will read the tree directory structure and automatically infer the partitioned schema? Or does one always need to add each partition to the table automatically?
Upvotes: 2
Views: 1250
Reputation: 421
you could use hive, since impala shares the same metastore this should not be a problem. I assume that in the definition of the talbe status_logs ,the LOCATION is set as '/hdfs/status_logs'.
in hive issue command
msck repair table status_logs
this will automatically add all the partitions to the metastore. and then back in impala , all you need to do is
invalidate metadata status_logs
you could see all the partitions. to verify you could do (this you probably already know).
show partitions status_logs
Upvotes: 2