Reputation: 241
I have a structured streaming query which sinks to Kafka. This query has a complex aggregation logic.
I would like to sink the output DF of this query to multiple Kafka topics each partitioned on a different ‘key’ column. I don't want to have multiple Kafka sinks for each of the different Kafka topics because that would mean running multiple streaming queries - one for each Kafka topic, especially since my aggregation logic is complex.
Questions:
Is there a way to output the results of a structured streaming query to multiple Kafka topics each with a different key column but without having to execute multiple streaming queries?
If not, would it be efficient to cascade the multiple queries such that the first query does the complex aggregation and writes output to Kafka and then the other queries just read the output of the first query and write their topics to Kafka thus avoiding doing the complex aggregation again?
Thanks in advance for any help.
Upvotes: 5
Views: 4012
Reputation: 323
I am also looking for solution of this problem and in my case its not necessarily kafka sink. I want to write some records of a dataframe in sink1 while some other records in sink2 (depending upon some condition, without reading the same data twice in 2 streaming queries). Currently it does not seem possible as per current implementation ( createSink() method in DataSource.scala provides support for a single sink).
However, In Spark 2.4.0 there is a new api coming: foreachBatch() which will give handle to a dataframe microbatch which can be used to cache the dataframe, write to different sinks or processing multiple times before uncaching aagin. Something like this:
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.cache()
batchDF.write.format(...).save(...) // location 1
batchDF.write.format(...).save(...) // location 2
batchDF.uncache()
}
right now this feature available in databricks runtime : https://docs.databricks.com/spark/latest/structured-streaming/foreach.html#reuse-existing-batch-data-sources-with-foreachbatch
EDIT 15/Nov/18 : It is available now in Spark 2.4.0 ( https://issues.apache.org/jira/browse/SPARK-24565)
Upvotes: 6
Reputation: 241
So the answer was kind of staring at me in the eye. It's documented as well. Link below.
One can write to multiple Kafka topics from a single query. If your dataframe that you want to write has a column named "topic" (along with "key", and "value" columns), it will write the contents of a row to the topic in that row. This automatically works. So the only thing you need to figure out is how to generate the value of that column.
This is documented - https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#writing-data-to-kafka
Upvotes: 10
Reputation: 4750
There is no way to have a single read and multiple writes in structured streaming out of the box. The only way is to implement custom sink that will write into multiple topics.
Whenever you call dataset.writeStream().start()
spark starts a new stream that reads from a source (readStream()
) and writes into a sink (writeStream()
).
Even if you try to cascade it spark will create two separate streams with one source and one sink each. In other words, it will read, process and write data twice:
Dataset df = <aggregation>;
StreamingQuery sq1 = df.writeStream()...start();
StreamingQuery sq2 = df.writeStream()...start();
There is a way to cache read data in spark streaming but this option is not available for structured streaming yet.
Upvotes: 2