Reputation: 99
I am generating Parquet files (partitioned by setid
, using Snappy compression) using Spark and storing them at a HDFS location.
df.coalesce(1).write.partitionBy("SetId").
mode(SaveMode.Overwrite).
format("parquet").
option("header","true").
save(args(1))
Parquet Data file is stored under /some-hdfs-path/testsp
I then create the Hive table for it as follows:
CREATE EXTERNAL TABLE DimCompany(
CompanyCode string,
CompanyShortName string,
CompanyDescription string,
BusinessDate string,
PeriodTypeInd string,
IrisDuplicateFlag int,
GenTimestamp timestamp
) partitioned by (SetId int)
STORED AS PARQUET LOCATION '/some-hdfs-path/testsp'
TBLPROPERTIES ('skip.header.line.count'='1','parquet.compress'='snappy');
However, when I select on table in Hive, it doesn't show any results.
I tried:
running msck
command like:
msck repair table dimcompany;
setting the following:
spark.sql("SET spark.sql.hive.convertMetastoreParquet=false")
None of those are working, how can I solve this?
Upvotes: 4
Views: 1854
Reputation: 158
The issue is that your partition column, SetId
, uses upper-case letters.
Since Hive converts its column names into lowercase, your partition column is stored as setid
instead of SetId
. Thus, when Hive searches for partitions/folders in a case-sensitive data store, it looks for setid=some_value
and finds nothing, since your data folders are of the format SetId=some_value
.
To make this work, convert SetId
to lower case or snake_case. You can use this by aliasing the column in your DataFrame:
df.select(
... {{ your other_columns }} ...,
col("SetId").alias("set_id")
)
You may also have to set these properties before executing the create statement, based on this StackOverflow post
SET hive.mapred.supports.subdirectories=TRUE;
SET mapred.input.dir.recursive=TRUE;
After you create your table, also try running
msck repair table <your_schema.your_table>;
Upvotes: 1