Asfand Qazi
Asfand Qazi

Reputation: 6875

DataBricks: Ingesting CSV data to a Delta Live Table in Python triggers "invalid characters in table name" error - how to set column mapping mode?

First off, can I just say that I am learning DataBricks at the time of writing this post, so I'd like simpler, cruder solutions as well as more sophisticated ones.

I am reading a CSV file like this:

df1 = spark.read.format("csv").option("header", True).load(path_to_csv_file)

Then I'm saving it as a Delta Live Table like this:

df1.write.format("delta").save("table_path")

The CSV headers have characters in them like space and & and /, and I get the error:

AnalysisException: Found invalid character(s) among " ,;{}()\n\t=" in the column names of your schema. Please enable column mapping by setting table property 'delta.columnMapping.mode' to 'name'. For more details, refer to https://docs.databricks.com/delta/delta-column-mapping.html Or you can use alias to rename it.

The documentation I've seen on the issue explains how to set the column mapping mode to 'name' AFTER a table has been created using ALTER TABLE, but does not explain how to set it at creation time, especially when using the DataFrame API as above. Is there a way to do this?

Is there a better way to get CSV into a new table?


UPDATE:

Reading the docs here and here, and inspired by Robert's answer, I tried this first:

spark.conf.set("spark.databricks.delta.defaults.columnMapping.mode", "name")

Still no luck, I get the same error. It's interesting how hard it is for a beginner to write a CSV file with spaces in its headers to a Delta Live Table

Upvotes: 9

Views: 11689

Answers (3)

Jenn Fang
Jenn Fang

Reputation: 21

The following worked for me

spark.conf.set("spark.databricks.delta.properties.defaults.columnMapping.mode","name")

In your original post, ".properties." was missing

Upvotes: 2

Asfand Qazi
Asfand Qazi

Reputation: 6875

Thanks to Hemant on the Databricks community forum, I have found the answer.

df1.write.format("delta").option("delta.columnMapping.mode", "name")
                         .option("path", "table_path").saveAsTable("new_table")

Now I can either query it with SQL or load it into a Spark dataframe:

SELECT * FROM new_table;
delta_df = spark.read.format("delta").load("table_path")
display(delta_df)

SQL Way

This method does the same thing but in SQL.

First, create a CSV-backed table for your CSV file:

CREATE TABLE table_csv
  USING CSV
  OPTIONS (path '/path/to/file.csv', 'header' 'true', 'mode' 'FAILFAST');

Then create a Delta table using the CSV-backed table:

CREATE TABLE delta_table
  USING DELTA
  TBLPROPERTIES ("delta.columnMapping.mode" = "name")
  AS SELECT * FROM table_csv;

SELECT * FROM delta_table;

I've verified that I get the same error as I did when using Python should I omit the TBLPROPERTIES statement.

I guess the Python answer would be to use spark.sql and run this using Python, that way I could embed the CSV path variable in the SQL.

Upvotes: 12

Robert Kossendey
Robert Kossendey

Reputation: 6998

You can set the option in the Spark Configuration of the cluster you are using. That is how you enable the mode at runtime.

You could also set the config at runtime like this:

spark.conf.set("spark.databricks.<name-of-property>", <value>)

Upvotes: 0

Related Questions