Egorsky
Egorsky

Reputation: 333

Convert spark dataframe to DeltaLake in Databricks

I have a Spark dataframe which is actually a huge parquet file read from the container instance in Azure. And I want to make delta lake format out of it. By every time I try to do that it throws an error without any message attached.

I want to save it to the Databricks itself or to the container instance (if possible).

I tried already df.write.format("delta").save("f"abfss://{container}@{storage_account_name}.dfs.core.windows.net/my_data)

and

df.write.format("delta").saveAsTable("f"abfss://{container}@{storage_account_name}.dfs.core.windows.net/my_data)

and


CREATE DELTA TABLE lifetime_delta
USING parquet
OPTIONS (f"abfss://{container}@{storage_account_name}.dfs.core.windows.net/")

I do think I need to create a table somehow and I heard that since parquet is native for Delta Lake it's already existing in Delta Lake context but for some reason it's not quite true.

None of it worked for me. Thank you in advance.

Upvotes: 1

Views: 2660

Answers (1)

Powers
Powers

Reputation: 19328

There are two main ways to convert Parquet files to a Delta Lake:

  • Read the Parquet files into a Spark DataFrame and write out the data as Delta files. Looks like this is what you're trying to do. Here's your code:
df.write.format("delta").save("f"abfss://{container}@{storage_account_name}.dfs.core.windows.net/my_data)

You may need to change it as follows in accordance with the Python f-string syntax:

df.write.format("delta").save(f"abfss://{container}@{storage_account_name}.dfs.core.windows.net/my_data")
  • You can also convert from Parquet to Delta Lake in place using the following code:
from delta import *

deltaTable = DeltaTable.convertToDelta(spark, "parquet.`tmp/lake2`")

Here's an example notebook with code snippets to perform this operation that you may find useful.

Upvotes: 3

Related Questions