Gopinath S
Gopinath S

Reputation: 121

How to write list object into a JSON file using apache beam?

I have a list of dictionary elements as shown below.

list_data = [
    {"id":"1", "name":"Cow", "type": "animal"},
    {"id":"2", "name":"Lion", "type": "animal"},
    {"id":"3", "name":"Peacock", "type": "bird"},
    {"id":"4", "name":"Giraffe", "type": "animal"}
]

I wish to write the above list into a JSON file using apache beam pipeline.

I tried doing like this:

class BeamProcess:

    def process_data():

        json_file_path = "gs://my_bucket/df_output/output.json"
        
        list_data = [
            {"id":"1", "name":"Cow", "type": "animal"},
            {"id":"2", "name":"Lion", "type": "animal"},
            {"id":"3", "name":"Peacock", "type": "bird"},
            {"id":"4", "name":"Giraffe", "type": "animal"}
        ]

        argv = [
                '--project=<my_project>',
                '--region=<region>',
                '--job_name=<custom_name>',
                '--temp_location=<temporary_location>',
                '--runner=DataflowRunner'
            ]

        p = beam.Pipeline(argv=argv)

        (
                p
                | 'Create' >> beam.Create(list_data)
                | 'Write Output' >> beam.io.WriteToText(json_file_path, shard_name_template='')
        )
        p.run().wait_until_finish()


if __name__ == "__main__":
    beam_proc = BeamProcess()
    beam_proc.process_data()

I end up seeing the below lines in the output.json file when I execute the above code.

{"id":"1", "name":"Cow", "type": "animal"}
{"id":"2", "name":"Lion", "type": "animal"}
{"id":"3", "name":"Peacock", "type": "bird"}
{"id":"4", "name":"Giraffe", "type": "animal"}

But what I wish to see is:

[
    {"id":"1", "name":"Cow", "type": "animal"},
    {"id":"2", "name":"Lion", "type": "animal"},
    {"id":"3", "name":"Peacock", "type": "bird"},
    {"id":"4", "name":"Giraffe", "type": "animal"}
]

What is the right way of writing the python list object as JSON file using apache beam ?

Upvotes: 1

Views: 1220

Answers (1)

Daniel Oliveira
Daniel Oliveira

Reputation: 1431

When beam.Create is given a list, it interprets it as a list of elements for the resulting PCollection. When you write out your PCollection to text, you're outputting four individual elements instead of a list, which is why it isn't formatted as you expect.

beam.Create([1, 2, 3, 4]) # Creates a PCollection of four int elements.

Therefore, in order to create a PCollection containing a list as an element, you need to nest the list you want to use as an element, like so:

beam.Create([[1, 2, 3, 4]]) # Creates a PCollection of one list element.

Upvotes: 3

Related Questions