GCP Dataflow batch processing

Question

I am doing POC for GCP Dataflow batch processing.

I want to pass Pandas Dataframe as batch input and perform columnar transformation and return same batch again.

I refer example provided in below link for MultipyByTwo https://beam.apache.org/documentation/programming-guide/

when I input Pandas Dataframe process_batch function is not executing.

Can you please let me know why and if possible please provide me example.

Code - pd = read_excel(path)

result1 = ( pd | "test batch" >> beam.ParDo(TestBatch(argv)))

DoFn Class -

class (TestBatch(beam.DoFn): def init(self, args: Any): self.args = args print("(TestBatch.init")

def setup(self) -> None:
    print("(TestBatch.setup")

def process_batch(self, batch: pd.DataFrame) -> pd.DataFrame:
    print("(TestBatch.process_batch")

    print(batch)

    yield batch

GCP Dataflow batch processing

Answers (1)

Related Questions