Python alternative to Java DataFlow Streaming code

Question

I have the following java code snippet which polls the gcs bucket for the arrival of new files. This is the code that I am using for my streaming pipeline, which would further load the data after applying some transformations into some destination.

PCollection pcollection = pipeline.apply("Read From streaming source",
                TextIO.read().from("gs://abc/xyz")
                        .watchForNewFiles(Duration.standardSeconds(10), Watch.Growth.never()));

but for a particular use case I need to achieve the same thing using python, for which I am unable to find the required libraries, also all the implementations for the streaming pipelines in python are for PubSub. Running the python pipeline with streaming=true doesn't solve any purpose as the code exits after completion and doesn't wait for new files. Can someone suggest a way to proceed? Thanks in advance.

I&#241;igo · Accepted Answer

You need MatchContinuously (doc), which was added in this Pull Request.

Edit:

Given that I saw you had some issues with the PTransform (I think you deleted the comments), I am going to add a sample code:

(p | MatchContinuously("gs://apache-beam-samples/shakespeare/*", interval=10.0)
   | Map(lambda x: x.path)
   | ReadAllFromText()
   | Map(lambda x: logging.info(x))
)

Python alternative to Java DataFlow Streaming code

Answers (1)

Related Questions