Reputation: 1635
I am trying to insert data into a Hive External table from Spark Sql. I am created the hive external table through the following command
CREATE EXTERNAL TABLE tab1 ( col1 type,col2 type ,col3 type) CLUSTERED BY (col1,col2) SORTED BY (col1) INTO 8 BUCKETS STORED AS PARQUET
In my spark job , I have written the following code Dataset df = session.read().option("header","true").csv(csvInput);
df.repartition(numBuckets, somecol)
.write()
.format("parquet")
.bucketBy(numBuckets,col1,col2)
.sortBy(col1)
.saveAsTable(hiveTableName);
Each time I am running this code I am getting the following exception
org.apache.spark.sql.AnalysisException: Table `tab1` already exists.;
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:408)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:393)
at somepackage.Parquet_Read_WriteNew.writeToParquetHiveMetastore(Parquet_Read_WriteNew.java:100)
Upvotes: 2
Views: 13090
Reputation: 4623
You should be specifying a save mode while saving the data in hive.
df.write.mode(SaveMode.Append)
.format("parquet")
.bucketBy(numBuckets,col1,col2)
.sortBy(col1)
.insertInto(hiveTableName);
Spark provides the following save modes:
Save Mode
ErrorIfExists
: Throws an exception if the target already exists. If target doesn’t exist write the data out.
Append
: If target already exists, append the data to it. If the data doesn’t exist write the data out.
Overwrite
: If the target already exists, delete the target. Write the data out.
Ignore
: If the target already exists, silently skip writing out. Otherwise write out the data.
Upvotes: 6
Reputation: 1584
You are using the saveAsTable
API, which create the table into Hive. Since you have already created the hive table through command, the table tab1
already exists. so when Spark API trying to create it, it throws error saying table already exists, org.apache.spark.sql.AnalysisException: Table
tab1already exists.
Either drop the table and let spark API saveAsTable
create the table itself.
Or use the API insertInto
to insert into an existing hive table.
df.repartition(numBuckets, somecol)
.write()
.format("parquet")
.bucketBy(numBuckets,col1,col2)
.sortBy(col1)
.insertInto(hiveTableName);
Upvotes: 6