Reputation: 619
I am using Apache-Beam with the Python SDK.
Currently, my pipeline reads multiple files, parse them and generate pandas dataframes from its data. Then, it groups them into a single dataframe.
What I want now is to retrieve this single fat dataframe, assigning it to a normal Python variable.
Is it possible to do?
Upvotes: 2
Views: 2760
Reputation: 17913
PCollection
is simply a logical node in the execution graph and its contents are not necessarily actually stored anywhere, so this is not possible directly.
However, you can ask your pipeline to write the PCollection
to a file (e.g. convert elements to strings and use WriteToText
with num_shards=1
), run the pipeline and wait for it to finish, and then read that file from your main program.
Upvotes: 5