Reputation: 673
Currently we have spark structured streaming
In arrow doc, I found arrow streaming, where we can create a stream in Python, produce the data, and use StreamReader
to consume the stream in Java/Scala
I am wondering if there is integration of these two, where we can do something like producing the arrow stream in Python and use spark structured streaming to get the stream (in distributed manner)?
Imagine a scenario, one want to build a easy to use Python api but the computing engine is on Java/Scala, using Kafka/Redis would not solve the data types across the languages. But using arrow there is currently no cluster support to access the data
Upvotes: 3
Views: 479
Reputation: 14891
Perhaps not exactly what you're looking for, but Spark 3.3 will have mapInArrow
API call - https://github.com/apache/spark/pull/34505
This will not work with streaming though.
Upvotes: 1
Reputation: 74669
I have never heard of a project like this. What you described is pretty much PySpark Structured Streaming where you have a running python application on one side talking to the Spark infrastructure running on JVM.
Upvotes: 0