Reputation: 189
I have the following java code snippet which polls the gcs bucket for the arrival of new files. This is the code that I am using for my streaming pipeline, which would further load the data after applying some transformations into some destination.
PCollection<String> pcollection = pipeline.apply("Read From streaming source",
TextIO.read().from("gs://abc/xyz")
.watchForNewFiles(Duration.standardSeconds(10), Watch.Growth.never()));
but for a particular use case I need to achieve the same thing using python, for which I am unable to find the required libraries, also all the implementations for the streaming pipelines in python are for PubSub. Running the python pipeline with streaming=true
doesn't solve any purpose as the code exits after completion and doesn't wait for new files. Can someone suggest a way to proceed?
Thanks in advance.
Upvotes: 0
Views: 165
Reputation: 2670
You need MatchContinuously
(doc), which was added in this Pull Request.
Edit:
Given that I saw you had some issues with the PTransform (I think you deleted the comments), I am going to add a sample code:
(p | MatchContinuously("gs://apache-beam-samples/shakespeare/*", interval=10.0)
| Map(lambda x: x.path)
| ReadAllFromText()
| Map(lambda x: logging.info(x))
)
Upvotes: 1