David Batista
David Batista

Reputation: 3124

Defining an Impala table schema for partitioned data

Having a data frame store in parquet format and partitioned physically over different files based on values from one or more columns in a filesystem structure like the following:

|-region=Europe
|   |-processed=1
|   |   |-part-r-00000-0515bd21-1cf1-40f1-885e-055a29284961.gz.parquet
|   |-processed=2
|   |   |-part-r-00000-0515bd21-1cf1-40f1-885e-055a29284961.gz.parquet
|   |-processed=3
|   |   |-part-r-00000-0515bd21-1cf1-40f1-885e-055a29284961.gz.parquet
|   |-processed=4
|       |-part-r-00000-0515bd21-1cf1-40f1-885e-055a29284961.gz.parquet
|-region=Asia
|   |-processed=2
|   |   |-part-r-00000-0515bd21-1cf1-40f1-885e-055a29284961.gz.parquet
|   |-processed=4
|       |-part-r-00000-0515bd21-1cf1-40f1-885e-055a29284961.gz.parquet
|-region=America
|   |-processed=3
|       |-part-r-00000-0515bd21-1cf1-40f1-885e-055a29284961.gz.parquet
|-_SUCCESS

To make the schema aware of the partitions I'm adding each partition manually:

ALTER TABLE status_log ADD PARTITION (region='Europe', processed="1") LOCATION '/hdfs/status_logs/region=Europe/processed=1';
ALTER TABLE status_log ADD PARTITION (region='Europe', processed="2") LOCATION '/hdfs/status_logs/region=Europe/processed=2';
ALTER TABLE status_log ADD PARTITION (region='Europe', processed="4") LOCATION '/hdfs/status_logs/region=Europe/processed=2';
...
..
.

Is there way to define the table schema such that Impala it will read the tree directory structure and automatically infer the partitioned schema? Or does one always need to add each partition to the table automatically?

Upvotes: 2

Views: 1250

Answers (2)

b1n0ys
b1n0ys

Reputation: 421

you could use hive, since impala shares the same metastore this should not be a problem. I assume that in the definition of the talbe status_logs ,the LOCATION is set as '/hdfs/status_logs'.

in hive issue command

msck repair table status_logs

this will automatically add all the partitions to the metastore. and then back in impala , all you need to do is

invalidate metadata status_logs

you could see all the partitions. to verify you could do (this you probably already know).

show partitions status_logs

Upvotes: 2

Related Questions