sree
sree

Reputation: 1960

How to stream datasets from a table?

I have to use spark to load data from a SQL server table to csv in local file system. Below is the code I used.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val jdbcSqlConnStr = "jdbc:sqlserver://HostIP:1433;databaseName=DBName;user=UserName;password=PWD;"
val jdbcDbTable = "dbo.Table"
val jdbcDF = sqlContext.read.format("jdbc").option("url",jdbcSqlConnStr).option("dbtable",jdbcDbTable).load()
jdbcDF.write.format("com.databricks.spark.csv").save("D:\\Spark\\2")

This is working fine. I need to try any other better aproach to further faster the above approach.

Is there any approach that I could stream the table directly from sql server and write to filesystem instead of storing for a while in-memory. Does this performs better than above? If yes, could someone please provide me details for the same

Upvotes: 3

Views: 2272

Answers (1)

Jacek Laskowski
Jacek Laskowski

Reputation: 74619

Is there any approach that I could stream the table directly from sql server and write to filesystem instead of storing for a while in-memory.

The feature is not directly supported by the current version of Apache Spark 2.2.0 (and requires some coding).

You could use Spark Structured Streaming (that's a module to execute structured queries) and write a custom streaming Source that would do the new row fetching and creating a DataFrame for processing.

The main question with a streaming source is how to track what changed between (micro-batch) executions. If you execute select * from table and the table has changed in-between, you simply get different results. You'd have to somehow change the (batch) query to get proper results (say delta between states of the table, i.e. what rows were back then and are now). That sounds like a Change data capture, doesn't it? That's where I'd look at, but that's something that is not supported out of the box in Apache Spark.

That would pretty much require to change your code to use readStream (not read) and the custom format (that uses CDC or some other "tools").

Upvotes: 2

Related Questions