Reputation: 5914
I am running my own GRPC server collecting events coming from various data sources. The server is developed in Go and all the event sources send the events in a predefined format as a protobuf message.
What I want to do is to process all these events with Apache Beam in memory.
I looked through the docs of Apache Beam and couldn't find a sample that does something like I want. I'm not going to use Kafka, Flink or any other streaming platform, just process the messages in memory and output the results.
Can someone show me a direction of a right way to start coding a simple stream processing app?
Upvotes: 0
Views: 1033
Reputation: 1443
Ok, first of all, Apache Beam is not a data processing engine, it's an SDK that allows you to create a unified pipeline and run it on different engines, like Spark, Flink, Google Dataflow, etc. So, to run a Beam pipeline you would need to leverage any of supported data processing engine or use DirectRunner
, which will run your pipeline locally but (!) it has many limitations and was mostly developed for testing purposes.
As every pipeline in Beam, one has to have a source transform (bounded or unbounded) which will read data from your data source. I can guess that in your case it will be your GRPC server which should retransmit collected events. So, for the source transform, you either can use already implemented Beam IO transforms (IO connectors) or create your own since there is no GrpcIO or something similar for now in Beam.
Regarding the processing data in memory, I'm not sure that I fully understand what you meant. It will mostly depend on used data processing engine since in the end, your Beam pipeline will be translated in, for example, Spark or Flink pipeline (if you use SparkRunner
or FlinkRunner
accordingly) before actual running and then data processing engine will manage the pipeline workflow. Most of the modern engines do their best efforts to keep all processed data in memory and flush it on disk only in the last resort.
Upvotes: 1