Timothyw0
Timothyw0

Reputation: 21

Spark SQL Update/Delete

Currently, I am working on a project using pySpark that reads in a few Hive tables, stores them as dataframes, and I have to perform a few updates/filters on them. I am avoiding using Spark syntax at all costs to create a framework that will only take SQL in a parameter file that will be run using my pySpark framework.

Now the problem is that I have to perform UPDATE/DELETE queries on my final dataframe, are there any possible work arounds to performing these operations on my dataframe?

Thank you so much!

Upvotes: 1

Views: 1808

Answers (1)

Cesar A. Mostacero
Cesar A. Mostacero

Reputation: 770

A DataFrame is immutable , you can not change it, so you are not able to update/delete.

If you want to "delete" there is a .filter option (it will create a new DF excluding records based on the validation that you applied on filter). If you want to "update", the closer equivalent is .map, where you can "modify" your record and that value will be on a new DF, the thing is that function will iterate all the records on the .df.

Another thing that you need to keep in mind is: if you load data into a df from some source (ie. Hive table) and perform some operations. That updated data wont be reflected on your source data. DF's live on memory, until you persist that data.

So, you can not work with DF like a sql-table for those operations. Depending on your requirements you need to analyze if Spark is a solution for your specific problem.

Upvotes: 4

Related Questions