Puneet Garg
Puneet Garg

Reputation: 41

How do spark streaming on hive tables using sql in NON-real time?

We have some data (millions) in hive tables which comes everyday. Next day, once the over-night ingestion is complete different applications query us for data (using sql)

We take this sql and make a call on spark

spark.sqlContext.sql(statement)  // hive-metastore integration is enabled

This is causing too much memory usage on spark driver, can we use spark streaming (or structured streaming), to stream the results in a piped fashion rather than collecting everything on driver and then sending to clients ?

We don't want to send out the data as soon it comes ( in typical streaming apps), but want to send a streaming data to clients when they ask (PULL) for data.

Upvotes: 0

Views: 1412

Answers (1)

Feroz
Feroz

Reputation: 161

IIUC..

  • Spark Streaming is mainly designed to process streaming data by converting into batches of Milliseconds to Seconds.

  • You can look over streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) provides you a very good functionality for Spark to write Streaming processed output Sink in micro-batch manner.

  • Nevertheless Spark structured streaming don't have a standard JDBC source defined to read from.

  • Work out for an option to directly store Hive underlying files in compressed and structured manner, transfer them directly rather than selecting through spark.sql if every client needs same/similar data or partition them based on where condition of spark.sql query and transfer needed files further.

Source:

Structured Streaming queries are processed using a micro-batch processing engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees.

ForeachBatch:

foreachBatch(...) allows you to specify a function that is executed on the output data of every micro-batch of a streaming query. Since Spark 2.4, this is supported in Scala, Java and Python. It takes two parameters: a DataFrame or Dataset that has the output data of a micro-batch and the unique ID of the micro-batch.

Upvotes: 1

Related Questions