Aman Vaishya
Aman Vaishya

Reputation: 189

Python alternative to Java DataFlow Streaming code

I have the following java code snippet which polls the gcs bucket for the arrival of new files. This is the code that I am using for my streaming pipeline, which would further load the data after applying some transformations into some destination.

PCollection<String> pcollection = pipeline.apply("Read From streaming source",
                TextIO.read().from("gs://abc/xyz")
                        .watchForNewFiles(Duration.standardSeconds(10), Watch.Growth.never()));

but for a particular use case I need to achieve the same thing using python, for which I am unable to find the required libraries, also all the implementations for the streaming pipelines in python are for PubSub. Running the python pipeline with streaming=true doesn't solve any purpose as the code exits after completion and doesn't wait for new files. Can someone suggest a way to proceed? Thanks in advance.

Upvotes: 0

Views: 165

Answers (1)

I&#241;igo
I&#241;igo

Reputation: 2670

You need MatchContinuously (doc), which was added in this Pull Request.

Edit:

Given that I saw you had some issues with the PTransform (I think you deleted the comments), I am going to add a sample code:

(p | MatchContinuously("gs://apache-beam-samples/shakespeare/*", interval=10.0)
   | Map(lambda x: x.path)
   | ReadAllFromText()
   | Map(lambda x: logging.info(x))
)

Upvotes: 1

Related Questions