ic10503
ic10503

Reputation: 130

Create hive table with partitions from a Scala dataframe

I need a way to create a hive table from a Scala dataframe. The hive table should have underlying files in ORC format in S3 location partitioned by date.

Here is what I have got so far:

I write the scala dataframe to S3 in ORC format

df.write.format("orc").partitionBy("date").save("S3Location)

I can see the ORC files in the S3 location. I create a hive table on the top of these ORC files now:

CREATE EXTERNAL TABLE "tableName"(columnName string)
PARTITIONED BY (date string)
STORED AS ORC
LOCATION "S3Location"
TBLPROPERTIES("orc.compress"="SNAPPY")

But the hive table is empty, i.e. spark.sql("select * from db.tableName") prints no results.

However, when I remove PARTITIONED BY line:

CREATE EXTERNAL TABLE "tableName"(columnName string, date string)
STORED AS ORC
LOCATION "S3Location"
TBLPROPERTIES("orc.compress"="SNAPPY")

I see results from the select query.

It seems that hive does not recognize the partitions created by spark. I am using Spark 2.2.0.

Any suggestions will be appreciated.

Update:

I am starting with a spark dataframe and I just need a way to create a hive table on the top of this(underlying files being in ORC format in S3 location).

Upvotes: 0

Views: 1119

Answers (1)

54l3d
54l3d

Reputation: 3973

I think the partitions are not added yet to the hive metastore, so u need only to run this hive command :

MSCK REPAIR TABLE table_name

If does not work, may be you need to check these points :

  • After writing data into s3, folder should be like : s3://anypathyouwant/mytablefolder/transaction_date=2020-10-30
  • when creating external table, the location should point to s3://anypathyouwant/mytablefolder

And yes, Spark writes data into s3 but does not add the partitions definitions into the hive metastore ! And hive is not aware of data written unless they are under a recognized partition. So to check what partitions are in the hive metastore, you can use this hive command :

SHOW PARTITIONS tablename

In production environment, i do not recommand using the MSCK REPAIR TABLE for this purpose coz it will be too much time consuming by time. The best way, is to make your code add only the newly created partitions to your metastore through rest api.

Upvotes: 1

Related Questions