Reputation: 115
I have a requirement to do a load/delete specific records from postgres db for my spark application. For loading , I am using spark dataframe in the below format
sqlContext.read.format("jdbc").options(Map("url" -> "postgres url",
"user" -> "user" ,
"password" -> "xxxxxx" ,
"table" -> "(select * from employee where emp_id > 1000) as filtered_emp")).load()
To delete the data, I am writing direct sql instead of using dataframes
delete from employee where emp_id > 1000
The question is , is there a spark way of deleting records in database something similar to below? Or the only way is to use direct sql?
sqlContext.read.format("jdbc").options(Map("url" -> "postgres url",
"user" -> "user" ,
"password" -> "xxxxxx" ,
"table" -> "(delete from employee where emp_id > 1000) as filtered_emp")).load()
Upvotes: 4
Views: 12486
Reputation: 311
I have experienced something similar and used Java JDBC libraries in Scala do delete rows from the table. You can see the solution here: Delete rows from Azure Sql table using Azure Databricks with Scala
Upvotes: 0
Reputation: 704
If you want to modify(delete records) the actual source of data i.e. tables in postgres then Spark wouldn't be a great way. You can use jdbc client directly for achieving the same.
If you want to do this anyway (in distrubuted manner based on some clues that you are computing as part of dataframes); you can have the same jdbc client code written in correspondence with dataframe that have logic/trigger info for deleting records and that can we executed on multiple workers parallely.
Upvotes: 3