Kenny_I
Kenny_I

Reputation: 2523

A schema mismatch detected when writing to the Delta table - Azure Databricks

I try to load "small_radio_json.json" to Delta Lake table. After this code I would create table.

I try create Delta table but getting error "A schema mismatch detected when writing to the Delta table." It may be related to partition of the events.write.format("delta").mode("overwrite").partitionBy("artist").save("/delta/events/")

How to fix or modify code.

    //https://learn.microsoft.com/en-us/azure/azure-databricks/databricks-extract-load-sql-data-warehouse
    //https://learn.microsoft.com/en-us/azure/databricks/_static/notebooks/delta/quickstart-scala.html
    
    //Session configuration
    val appID = "123558b9-3525-4c62-8c48-d3d7e2c16a6a"
    val secret = "123[xEPjpOIBJtBS-W9B9Zsv7h9IF:qw"
    val tenantID = "12344839-0afa-4fae-a34a-326c42112bca"

    spark.conf.set("fs.azure.account.auth.type", "OAuth")
    spark.conf.set("fs.azure.account.oauth.provider.type", 
    "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
    spark.conf.set("fs.azure.account.oauth2.client.id", "<appID>")
    spark.conf.set("fs.azure.account.oauth2.client.secret", "<secret>")
   spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/<tenant- 
   id>/oauth2/token")
   spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")

   //Account Information
    val storageAccountName = "mydatalake"
   val fileSystemName = "fileshare1"

    spark.conf.set("fs.azure.account.auth.type." + storageAccountName + ".dfs.core.windows.net", "OAuth")
    spark.conf.set("fs.azure.account.oauth.provider.type." + storageAccountName + 
    ".dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
    spark.conf.set("fs.azure.account.oauth2.client.id." + storageAccountName + ".dfs.core.windows.net", 
    "" + appID + "")
    spark.conf.set("fs.azure.account.oauth2.client.secret." + storageAccountName + 
    ".dfs.core.windows.net", "" + secret + "")
    spark.conf.set("fs.azure.account.oauth2.client.endpoint." + storageAccountName + 
    ".dfs.core.windows.net", "https://login.microsoftonline.com/" + tenantID + "/oauth2/token")
    spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")
    dbutils.fs.ls("abfss://" + fileSystemName  + "@" + storageAccountName + ".dfs.core.windows.net/")
    spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "false")

    dbutils.fs.cp("file:///tmp/small_radio_json.json", "abfss://" + fileSystemName + "@" + 
    storageAccountName + ".dfs.core.windows.net/")

    val df = spark.read.json("abfss://" + fileSystemName + "@" + storageAccountName + 
   ".dfs.core.windows.net/small_radio_json.json")

    //df.show()

    import org.apache.spark.sql._
   import org.apache.spark.sql.functions._

    val events = df
  
    display(events)

    import org.apache.spark.sql.SaveMode

    events.write.format("delta").mode("overwrite").partitionBy("artist").save("/delta/events/")
    import org.apache.spark.sql.SaveMode

   val events_delta = spark.read.format("delta").load("/delta/events/")
    display(events_delta)

The exception:

    org.apache.spark.sql.AnalysisException: A schema mismatch detected when writing to the Delta table.
    To enable schema migration, please set:
    '.option("mergeSchema", "true")'.

    Table schema:
    root
    -- action: string (nullable = true)
    -- date: string (nullable = true)


    Data schema:
    root
    -- artist: string (nullable = true)
    -- auth: string (nullable = true)
    -- firstName: string (nullable = true)
    -- gender: string (nullable = true)

Upvotes: 11

Views: 52810

Answers (2)

venus
venus

Reputation: 1258

Most probably /delta/events/ directory has some data from the previous run, and this data might have a different schema than the current one, so while loading new data to the same directory you will get such type of exception.

Upvotes: 4

Nikunj Kakadiya
Nikunj Kakadiya

Reputation: 3008

You are getting the schema mismatch error because the columns in your table are different from the columns that you have in your dataframe.

As per the Error Snapshot that you have pasted in the question your table schema has only two columns while your dataframe schema has four columns:

Table schema:
root
-- action: string (nullable = true)
-- date: string (nullable = true)


Data schema:
root
-- artist: string (nullable = true)
-- auth: string (nullable = true)
-- firstName: string (nullable = true)
-- gender: string (nullable = true)

Now you have two options

  1. if you want to keep the schema that is present in the dataframe then you can add option of overwriteSchema to true.
  2. if you want to keep all the columns you can set the option of mergeSchema to true. In this case it will merge the schema and now table would have six columns i.e two existing columns and four new columns that you have in the dataframe.

Upvotes: 14

Related Questions