MAhmad
MAhmad

Reputation: 5

Update database table with Pyspark in Databricks

I have a table in Azure SQL Server database which is populated from my Dataframe. I want to udpate this table based upon multiple conditions databricks using pyspark / pandas. Me being new to PySpark / Databricks / Pandas, can someone please advise how to update the table ? i have inserted the data into the table - one solution that i could think of is to load the data from the table into a dataframe and then merge the new file into the same dataframe, then delete the data from table and insert this dataframe. If this is the right approach, then how can we delete the data from database table in the above scenario?

Upvotes: 0

Views: 2343

Answers (1)

ASH
ASH

Reputation: 20302

As you stated, 'load the data from the table into a dataframe and then merge the new file into the same dataframe, then delete the data from table and insert this dataframe.' That's definitely one option. I don't know if that's the absolute best practice, but it should be pretty darn fast, and almost certainly the preferred way to do this, because the cluster will run in parallel, and as such, data manipulation, calculations, etc., will be done super-fast! Of course, you can run SQL updates directly on the table. If that tables are really large (like billions of records and dozens of columns), it's probably going to be super-slow (SQL will NOT run in parallel on a cluster, whereas Spark will do exactly this).

See the link below for some additional ideas of what can be done.

https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html

Upvotes: 0

Related Questions