Caca
Caca

Reputation: 31

Structured Streaming extract most recent values for each id

I have datastream containing ID, type, and value: For a group of users with given ID I receive measurements (values) from different sensors (type). Example of incoming data:

ID type value
1  A    70
2  B    16
1  A    71
2  A    72

I need to create Spark Structured Streaming app that will perform custom clustering of the obtained data. However, I am stuck at the begining> I don't know how to create a set of data that will contain the last measurements for each user for each type. I need to have this set for every user that has ever appeared in the system.

So, basically, for a data stream described above, I need a Structured Streaming app that will give me a set of last measurements for every user for every type>

  ID type value
  1  A    71
  2  B    16
  2  A    72

Users may be inactive for some time, I still need to keep their record. It would be useful if the output is a dataframe.

Any ideas for how to do this will be very welcome.

PS I am fairly new to Spark Structured Streaming, sorry if this is a trivial question.

Upvotes: 0

Views: 2105

Answers (1)

Ged
Ged

Reputation: 18013

The short answer is: this is not possible with Spark Structured Streaming (currently).

Many posts on this and none have suggested a solution that actually works.

When you think about it, in reality it is a tall order.

I tried various approaches - even though I knew it was not possible - and always got some sort of error from Spark. These are documented on Stack Overflow at length. E.g.:

Structured streaming custom deduplication

Retain last row for given key in spark structured streaming

Upvotes: 2

Related Questions