Xaphanius
Xaphanius

Reputation: 619

How to retrieve the content of a PCollection and assign it to a normal variable?

I am using Apache-Beam with the Python SDK.

Currently, my pipeline reads multiple files, parse them and generate pandas dataframes from its data. Then, it groups them into a single dataframe.

What I want now is to retrieve this single fat dataframe, assigning it to a normal Python variable.

Is it possible to do?

Upvotes: 2

Views: 2760

Answers (1)

jkff
jkff

Reputation: 17913

PCollection is simply a logical node in the execution graph and its contents are not necessarily actually stored anywhere, so this is not possible directly.

However, you can ask your pipeline to write the PCollection to a file (e.g. convert elements to strings and use WriteToText with num_shards=1), run the pipeline and wait for it to finish, and then read that file from your main program.

Upvotes: 5

Related Questions