Gaurav
Gaurav

Reputation: 99

Hive external tables not able to see partitioned Parquet files

I am generating Parquet files (partitioned by setid, using Snappy compression) using Spark and storing them at a HDFS location.

df.coalesce(1).write.partitionBy("SetId").
  mode(SaveMode.Overwrite).
  format("parquet").
  option("header","true").
  save(args(1))

Parquet Data file is stored under /some-hdfs-path/testsp

I then create the Hive table for it as follows:

CREATE EXTERNAL TABLE DimCompany(
  CompanyCode string,
  CompanyShortName string,
  CompanyDescription string,
  BusinessDate string,
  PeriodTypeInd string,
  IrisDuplicateFlag int,
  GenTimestamp timestamp
) partitioned by (SetId int)
STORED AS PARQUET LOCATION '/some-hdfs-path/testsp'
TBLPROPERTIES ('skip.header.line.count'='1','parquet.compress'='snappy');

However, when I select on table in Hive, it doesn't show any results.

I tried:

  1. running msck command like:

    msck repair table dimcompany;
    
  2. setting the following:

    spark.sql("SET spark.sql.hive.convertMetastoreParquet=false")
    

None of those are working, how can I solve this?

Upvotes: 4

Views: 1854

Answers (1)

schwadan
schwadan

Reputation: 158

The issue is that your partition column, SetId, uses upper-case letters.

Since Hive converts its column names into lowercase, your partition column is stored as setid instead of SetId. Thus, when Hive searches for partitions/folders in a case-sensitive data store, it looks for setid=some_value and finds nothing, since your data folders are of the format SetId=some_value.

To make this work, convert SetId to lower case or snake_case. You can use this by aliasing the column in your DataFrame:

df.select(
... {{ your other_columns }} ...,
col("SetId").alias("set_id")
)

You may also have to set these properties before executing the create statement, based on this StackOverflow post

SET hive.mapred.supports.subdirectories=TRUE;
SET mapred.input.dir.recursive=TRUE;

After you create your table, also try running

msck repair table <your_schema.your_table>;

Upvotes: 1

Related Questions