How to write from a PySpark DStream to Redis?

Question

I am using PySpark 2.3.1 to read a stream of values from Kafka as DStreams. I want to do some transforms on this data, like take a moving average, and save it to Redis. My spark job code looks a bit like this:

batch_duration = 1

# Initialize session
spark_session = SparkSession \
    .builder \
    .appName("my-app") \
    .getOrCreate()

spark_context = spark_session.sparkContext

# Create streaming context (=connection to Spark)
streaming_context = StreamingContext(spark_context, batch_duration)

# Read from Kafka
input = KafkaUtils \
    .createDirectStream(streaming_context, ['price'], {"metadata.broker.list": kafka_urls})

I can then transform it with lines like:

jsons = input.window(5000).map(lambda t: t[1]).map(json.loads)
prices = jsons.map(lambda d: d['price'])
total = prices.reduce(lambda x, y: x + y)

However total in this case is still a DStream, and the documentation for Redis says that only Dataframes can be written from PySpark. Fortunately, DStream produces periodic RDD's as it runs - so I have to figure out how to convert the RDD to a Dataframe.

I tried

total.foreachRDD(lambda rdd:
                 rdd.toDF().write.format("org.apache.spark.sql.redis") \
                 .option("table", "people") \
                 .option("key.column", "name") \
                 .save())

Admittedly this was copied and pasted blindly from elsewhere on the net, so the option calls almost certainly don't match my data schema. I was hoping to decipher the exceptions and figure out where to go next. Unfortunately running this on my Spark cluster prints many lines of Java stack traces, and scrolls the original Python exception outside of my console history, so I can't figure out what is causing the problem.

How to write from a PySpark DStream to Redis?

Answers (1)

Related Questions