Opinion: Querying databases from Spark streaming or Structured streaming tasks

Question

We have a Spark streaming use case where we need to compute some metrics from ingested events (in Kafka), but the computations require additional metadata which are not present in the events.

The obvious design pattern I can think of is to make point queries to the metadata tables (on the master DB) from spark executor tasks and use that metadata info during the processing of each event.

Another idea would be to "enrich" the ingested events in a separate pipeline as a preprocessor step before sending them to Kafka. This could be done, say by another service or task.

The second scenario is more useful in cases when the domain/environment where Spark/hadoop runs is isolated from the domain of the master DB where all metadata is stored.

Is there a general consensus on how this type of event "enrichment" should be done? What other considerations am I missing here?

Opinion: Querying databases from Spark streaming or Structured streaming tasks

Answers (1)

Related Questions