Chris Townsend
Chris Townsend

Reputation: 3162

How to show column data type of a parquet file with Apache Drill?

I'm trying to compare differences in parquet files. One set was created with Apache Drill and another with Apache Spark. The set created with Drill has known types because the conversion uses a create table as and explicitly casts the types. The Spark created set uses a simple save of the RDD to parquet and is much larger. I'd like to get the types from the parquet file created by Spark but can't query the schema for it with Drill.

All the parquet files were moved into or created in /tmp

I've tried things like this:

use dfs.tmp; 
SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = `tweet` AND TABLE_SCHEMA = `dfs.tmp`;

The tables don't show this way but do show up when I issue a show files command. My understanding of the documentation is that is to be expected but I don't see how I can view the data types of the parquet files.

Upvotes: 1

Views: 5426

Answers (1)

Arina Ielchiieva
Arina Ielchiieva

Reputation: 671

Currently INFORMATION_SCHEMA can show data types of views and tables but not for file-based data sources.

The TABLES table returns the table name and type for each table or view in your databases. (Type means TABLE or VIEW.) Note that Drill does not return files available for querying in file-based data sources. Instead, use SHOW FILES to explore these data sources.

To compare types you may use typeOf function over each column (select typeof(col1), ... from t) or parquet tools to inspect parquet files.

Upvotes: 2

Related Questions