CombineFn for Python dict in Apache Beam pipeline

Question

I've been experimenting with the Apache Beam SDK in Python to write data processing pipelines.

My data mocks IoT sensor data from a Google PubSub topic that streams JSON data like this:

{"id": 1, "temperature": 12.34}
{"id": 2, "temperature": 76.54}

There are IDs ranging from 0 to 99. Reading the JSON into a Python dict is no problem.

I created a custom CombineFn to process by CombinePerKey. I hoped that the output of my accumulator would be the calculations, grouped by the respective id fields from the dictionaries in the PCollection.

However, when the add_input method is called, it only receives the string temperature instead of the whole dictionary. I also did not find any reference to tell CombinePerKey which key (id field in my case) I want it to group data.

Maybe I also misunderstood the concept of CombinePerKey and CombineFn. I'd appreciate any help or hint on this. Maybe someone has an example for processing JSON batches with ID based grouping? Do I have to convert the dictionary into something else?

CombineFn for Python dict in Apache Beam pipeline

Answers (1)

Related Questions