Python Apache Beam Multiple Outputs & Processing

Question

I am trying to run a job on Google Dataflow with the following process flow:

Essentially taking a single datasource, filtering based on certain values within the dictionary and create separate outputs for each filter criteria.

I've written the following code:

# List of values to filter by
x_list = [1, 2, 3]

with beam.Pipeline(options=PipelineOptions().from_dictionary(pipeline_params)) as p:
    # Read in newline JSON data - each line is a dictionary
    log_data = (
        p 
        | "Create " + input_file >> beam.io.textio.ReadFromText(input_file)
        | "Load " + input_file >> beam.FlatMap(lambda x: json.loads(x))
    )
    
    # For each value in x_list, filter log_data for dictionaries containing the value & write out to separate file
    for i in x_list:
        # Return dictionary if given key = value in filter
        filtered_log = log_data | "Filter_"+i >> beam.Filter(lambda x: x['key'] == i)
        # Do additional processing
        processed_log = process_pcoll(filtered_log, event)
        # Write final file
        output = (
            processed_log
            | 'Dump_json_'+filename >> beam.Map(json.dumps)
            | "Save_"+filename >> beam.io.WriteToText(output_fp+filename,num_shards=0,shard_name_template="")
        )

Currently it only processes the first value in the list. I know that I probably have to use ParDo, but I'm not very sure how to factor that into my process.

Python Apache Beam Multiple Outputs & Processing

Answers (1)

Related Questions

Python Apache Beam Multiple Outputs &amp; Processing

Answers (1)

Related Questions

Python Apache Beam Multiple Outputs & Processing