Reputation: 31
I have datastream containing ID, type, and value: For a group of users with given ID I receive measurements (values) from different sensors (type). Example of incoming data:
ID type value
1 A 70
2 B 16
1 A 71
2 A 72
I need to create Spark Structured Streaming app that will perform custom clustering of the obtained data. However, I am stuck at the begining> I don't know how to create a set of data that will contain the last measurements for each user for each type. I need to have this set for every user that has ever appeared in the system.
So, basically, for a data stream described above, I need a Structured Streaming app that will give me a set of last measurements for every user for every type>
ID type value
1 A 71
2 B 16
2 A 72
Users may be inactive for some time, I still need to keep their record. It would be useful if the output is a dataframe.
Any ideas for how to do this will be very welcome.
PS I am fairly new to Spark Structured Streaming, sorry if this is a trivial question.
Upvotes: 0
Views: 2105
Reputation: 18013
The short answer is: this is not possible with Spark Structured Streaming (currently).
Many posts on this and none have suggested a solution that actually works.
When you think about it, in reality it is a tall order.
I tried various approaches - even though I knew it was not possible - and always got some sort of error from Spark. These are documented on Stack Overflow at length. E.g.:
Structured streaming custom deduplication
Retain last row for given key in spark structured streaming
Upvotes: 2