Spark batch write to Kafka topic from multi-column DataFrame

Question

After the batch, Spark ETL I need to write to Kafka topic the resulting DataFrame that contains multiple different columns.

According to the following Spark documentation https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html the Dataframe being written to Kafka should have the following mandatory column in schema:

value (required) string or binary

As I mentioned previously, I have much more columns with values so I have a question - how to properly send the whole DataFrame row as a single message to Kafka topic from my Spark application? Do I need to join all of the values from all columns into the new DataFrame with a single value column(that will contain the joined value) or there is more proper way to achieve it?

user10696091 · Accepted Answer

The proper way to do that is already hinted by the docs, and doesn't really differ form what you'd do with any Kafka client - you have to serialize the payload before sending to Kafka.

How you you'll do that (to_json, to_csv, Apache Avro) depends on your business requirements - nobody can answers this but you (or your team).

Spark batch write to Kafka topic from multi-column DataFrame

Answers (1)

Related Questions