Using SourceFunction and SinkFunction in PyFlink

Question

I am new to PyFlink. I have done the official training exercise in Java: https://github.com/apache/flink-training

However, the project I am working on must use Python as a programming language. I want to know if it is possible to write a data generator using the "SourceFunction". In older PyFlink versions this was possible, using Jython: https://nightlies.apache.org/flink/flink-docs-release-1.7/dev/stream/python.html#streaming-program-example

In newer examples the dataframe contains a finite set of data, which is never extended. I have not found any example of a data generator in PyFlink, e.g. https://github.com/apache/flink-training/blob/master/common/src/main/java/org/apache/flink/training/exercises/common/sources/TaxiRideGenerator.java

I am not sure which functionality the interfaces Source and SinkFunction provide. Can it be used somehow in python or can it only be used in combination with other pipelines or jar files? It looks like the methods "run()" and "cancel()" are not implemented and thus it cannot be used like some other classes, by overloading.

If it can not be used in Python, are there any other ways to use it? Someone may provide an easy example.

If it is not possible to use it, are there any other ways to write a data generator in OOP style? Take this example: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/python/datastream_tutorial/ There the split() method is used to separate the stream. Basically, I want to do this by an extra class and just extending the stream, which was done in the Java TaxiRide example via "ctx.collect()". I am trying to avoid using Java, another framework for the pipeline, and Jython. It would be nice to get a short example code, but I appreciate any tips and advice.

I tried to use SourceFunction directly, but as already mentioned, I think this is a completely wrong way, resulting in an error: AttributeError: 'DataGenerator' object has no attribute '_get_object_id'

class DataGenerator(SourceFunction):
    def __init__(self):
        super().__init__(self)
        self._num_iters = 1000
        self._running = True

    def run(self, ctx):
        counter = 0
        while self._running and counter < self._num_iters:
            ctx.collect('Hello World')
            counter += 1

    def cancel(self):
        self._running = False

Using SourceFunction and SinkFunction in PyFlink

Answers (1)

Related Questions