Reputation: 6875
First off, can I just say that I am learning DataBricks at the time of writing this post, so I'd like simpler, cruder solutions as well as more sophisticated ones.
I am reading a CSV file like this:
df1 = spark.read.format("csv").option("header", True).load(path_to_csv_file)
Then I'm saving it as a Delta Live Table like this:
df1.write.format("delta").save("table_path")
The CSV headers have characters in them like space and &
and /
, and I get the error:
AnalysisException: Found invalid character(s) among " ,;{}()\n\t=" in the column names of your schema. Please enable column mapping by setting table property 'delta.columnMapping.mode' to 'name'. For more details, refer to https://docs.databricks.com/delta/delta-column-mapping.html Or you can use alias to rename it.
The documentation I've seen on the issue explains how to set the column mapping mode to 'name' AFTER a table has been created using ALTER TABLE
, but does not explain how to set it at creation time, especially when using the DataFrame API as above. Is there a way to do this?
Is there a better way to get CSV into a new table?
UPDATE:
Reading the docs here and here, and inspired by Robert's answer, I tried this first:
spark.conf.set("spark.databricks.delta.defaults.columnMapping.mode", "name")
Still no luck, I get the same error. It's interesting how hard it is for a beginner to write a CSV file with spaces in its headers to a Delta Live Table
Upvotes: 9
Views: 11689
Reputation: 21
The following worked for me
spark.conf.set("spark.databricks.delta.properties.defaults.columnMapping.mode","name")
In your original post, ".properties." was missing
Upvotes: 2
Reputation: 6875
Thanks to Hemant on the Databricks community forum, I have found the answer.
df1.write.format("delta").option("delta.columnMapping.mode", "name")
.option("path", "table_path").saveAsTable("new_table")
Now I can either query it with SQL or load it into a Spark dataframe:
SELECT * FROM new_table;
delta_df = spark.read.format("delta").load("table_path")
display(delta_df)
This method does the same thing but in SQL.
First, create a CSV-backed table for your CSV file:
CREATE TABLE table_csv
USING CSV
OPTIONS (path '/path/to/file.csv', 'header' 'true', 'mode' 'FAILFAST');
Then create a Delta table using the CSV-backed table:
CREATE TABLE delta_table
USING DELTA
TBLPROPERTIES ("delta.columnMapping.mode" = "name")
AS SELECT * FROM table_csv;
SELECT * FROM delta_table;
I've verified that I get the same error as I did when using Python should I omit the TBLPROPERTIES statement.
I guess the Python answer would be to use spark.sql
and run this using Python, that way I could embed the CSV path variable in the SQL.
Upvotes: 12
Reputation: 6998
You can set the option in the Spark Configuration of the cluster you are using. That is how you enable the mode at runtime.
You could also set the config at runtime like this:
spark.conf.set("spark.databricks.<name-of-property>", <value>)
Upvotes: 0