Spark Streaming dropDuplicates

Question

Spark 2.1.1 (scala api) streaming json files from an s3 location.

I want to deduplicate any incoming records based on an ID column (“event_id”) found in the json for every record. I do not care which record is kept, even if duplication of the record is only partial. I am using append mode as the data is merely being enriched/filtered, with no group by/window aggregations, via the spark.sql() method. I then use the append mode to write parquet files to s3.

According to the documentation, I should be able to use dropDuplicates without watermarking in order to deduplicate (obviously this is not effective in long-running production). However, this fails with the error:

User class threw exception: org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets

That error seems odd as I am doing no aggregation (unless dropDuplicates or sparkSQL counts as an aggregation?).

I know that duplicates won’t occur outside 3 days of each other, so I then tried it again by adding a watermark (by using .withWatermark() immediately before the drop duplicates). However, it seems to want to wait until 3 days are up before writing the data. (ie since today is July 24, only data up to the same time on July 21 is written to the output).

As there is no aggregation, I want to write every row immediately after the batch is processed, and simply throw away any rows with an event id that has occurred in the previous 3 days. Is there a simple way to accomplish this?

Thanks

TaylerJones · Accepted Answer

Solution we used was a custom implementation of org.apache.spark.sql.execution.streaming.Sink that inserts into a hive table after dropping duplicates within batch and performing a left anti join against the previous few days worth of data in the target hive table.

Spark Streaming dropDuplicates

Answers (2)

Related Questions