Reputation: 245
I am finding it difficult to load parquet files into hive tables. I am working on Amazon EMR cluster and spark for Data processing. But i need to read the output parquet files to validate my transformations. i have the parquet files with following schema:
root
|-- ATTR_YEAR: long (nullable = true)
|-- afil: struct (nullable = true)
| |-- clm: struct (nullable = true)
| | |-- amb: struct (nullable = true)
| | | |-- L: string (nullable = true)
| | | |-- cdTransRsn: string (nullable = true)
| | | |-- dist: struct (nullable = true)
| | | | |-- T: string (nullable = true)
| | | | |-- content: double (nullable = true)
| | | |-- dscStrchPurp: string (nullable = true)
| | |-- amt: struct (nullable = true)
| | | |-- L: string (nullable = true)
| | | |-- T: string (nullable = true)
| | | |-- content: double (nullable = true)
| | |-- amtTotChrg: double (nullable = true)
| | |-- cdAccState: string (nullable = true)
| | |-- cdCause: string (nullable = true)
how can i create hive external table using this type of schema and load the parquet files into that hive table for analysis?
Upvotes: 1
Views: 1011
Reputation: 35249
You can use Catalog.createExternalTable
(Spark before 2.2) or Catalog.createTable
(Spark 2.2 and later).
Catalog
instance can be accessed using SparkSession
:
val spark: SparkSession
spark.catalog.createTable(...)
Session should be initialized with Hive support enabled.
Upvotes: 0