Srinivas Bandaru
Srinivas Bandaru

Reputation: 311

Unable to directly load hive parquet table using spark dataframe

I have gone thru related posts available in SO and couldn't find this specific issue anywhere over the internet.

I am trying to load Hive table (Hive external table pointed to parquet files) but spark data frame couldn't read the data and it is just able to read schema. But for the same hive table i can query from hive shell. When i try to load hive table into dataframe it is not returning any data. Below is my script looks like and the DDL. I am using Spark 2.1 (Mapr distribution)

Unable to read data from hive table has underlying parquet files from spark

val df4 = spark.sql("select * from default.Tablename")
scala> df4.show()
+----------------------+------------------------+----------+---+-------------+-------------+---------+
|col1           |col2           |col3                       |key            |col4|  record_status|source_cd|
+----------------------+------------------------+----------+---+-------------+-------------+---------+
+----------------------+------------------------+----------+---+-------------+-------------+---------+

Hive DDL
CREATE EXTERNAL TABLE `Tablename`(
  `col1` string,
  `col2` string,
  `col3` decimal(19,0),
  `key` string,
  `col6` string,
  `record_status` string,
  `source_cd` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
  'path'='maprfs:abc/bds/dbname.db/Tablename')
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'maprfs:/Datalocation/Tablename'
TBLPROPERTIES (
  'numFiles'='2',
  'spark.sql.sources.provider'='parquet',
  'spark.sql.sources.schema.numParts'='1',
  'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"col1\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"col2\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"col3\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"key\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"col6\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"record_status\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"source_cd\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}',
  'totalSize'='68216',
  'transient_lastDdlTime'='1502904476')

Upvotes: 2

Views: 2231

Answers (1)

user1997656
user1997656

Reputation: 532

remove 'spark.sql.sources.provider'='parquet' and you will success

Upvotes: 1

Related Questions