Seb
Seb

Reputation: 378

Impala table from spark partitioned parquet files

I have generated some partitioned parquet data using Spark, and I'm wondering how to map it to an Impala table... Sadly, I haven't found any solution yet.

The schema of parquet is like :

{ key: long,
value: string,
date: long }

and I partitioned it with key and date, that gives me this kind of directories on my hdfs :

/data/key=1/date=20170101/files.parquet
/data/key=1/date=20170102/files.parquet
/data/key=2/date=20170101/files.parquet
/data/key=2/date=20170102/files.parquet
...

Do you know how I could tell Impala to create a table from this dataset with corresponding partitions (and without having to loop on each partition as I could have read) ? Is it possible ?

Thank you in advance

Upvotes: 3

Views: 2181

Answers (1)

kartik
kartik

Reputation: 168

Assuming by schema of parquet , you meant the schema of the dataset and then using the columns to partition , you will have only the key column in the actual files.parquet files . Now you can proceed as follows

The solution is to use an impala external table .

create external table mytable (key BIGINT) partitioned by (value String ,
date BIGINT) stored as parquet location '....../data/'

Note that in above statement , you have to give path till the data folder

alter table mytable recover partitions'

refresh mytable;

The above 2 commands will automatically detect the partitions based on the schema of the table and get to know about the parquet files present in the sub directories.

Now , you can start querying the data .

Hope it helps

Upvotes: 3

Related Questions