apache beam compare two datasets with only one column

Question

My task is to compare two datasets in apache beam within dataflow runner and output three stages, one common in dataset1, another only in dataset1, then only in dataset 2.

I tried using CoGroupByKey however i am not sure it can be uased as we have only a one dimensional list. How can we compare these. What I tried is merge tow pcollections as shown below

import apache_beam as beam


with beam.Pipeline() as p:

  dataset1 = ['data1a', 'data1b', 'data1c', 'data1d']
  dataset2 = ['data2a', 'data2b', 'data1b', 'data2d']
  dataset1_pcoll = p | 'Read Dataset 1' >> beam.Create(dataset1)
  dataset2_pcoll = p | 'Read Dataset 2' >> beam.Create(dataset2)
  

  combined_data = (
            {
                'file1': dataset1_pcoll,
                'file2': dataset2_pcoll,
            }
            | 'CoGroup Files' >> beam.CoGroupByKey()          
        )

getting the error as below

 wrapper = lambda x: [fn(*x)]
TypeError: () takes 2 positional arguments but 6 were given [while running 'CoGroup Files/CoGroupByKeyImpl/Tag[file2]']

apache beam compare two datasets with only one column

Answers (1)

Related Questions