Daljeet Singh
Daljeet Singh

Reputation: 714

apache_beam, read data from GCS buckets during pipeline

I have a pub/sub topic, which gets message as soon as a file is created in the bucket, with the streaming pipeline, I am able to get the object path. Created file is AVRO. Now in my pipeline I want to read all the content of the different files, that was captured in the pub/sub messages in the pvs step.

Any direction to the solution will be appriciated, using given functions in apache_beam.

Using python sdk

Upvotes: 1

Views: 384

Answers (1)

Daljeet Singh
Daljeet Singh

Reputation: 714

Was able to do this using ReadAllFromAvro() Ref: https://beam.apache.org/releases/pydoc/2.29.0/apache_beam.io.avroio.html?highlight=readallfromavro#apache_beam.io.avroio.ReadAllFromAvro

    with beam.Pipeline(options=self.pipeline_options) as pipeline:
        (
            pipeline
            | "Read from Pub/Sub" >> beam.io.MultipleReadFromPubSub(pubsub_subscribers, with_attributes=True)
            | "Windowing Transformation" >> GroupMessagesByFixedWindowsTransform(
            self.streamverse_options.window_size)
            | "Read From GCS" >> beam.io.ReadAllFromAvro()
            | beam.Map(print)
        )

Upvotes: 1

Related Questions