Reputation: 523
Let's say I want to generate 100 trillion pieces of data (random numbers to keep it simple), and I'd like to use Google Dataflow to do it.
I can think of a dumb way to do this (I'm not 100% sure this would work, but this is where I'd start trying): take a text file that's 10 million lines long, and for every line in the input text file have a DoFn that loops for 10 million iterations, outputting a randomly generated number each iteration that are all eventually outputted to a text file. (whatever is in the original text file would just be ignored).
But I can't help but think there might be a better, less-hacky way to generate data using Dataflow. Any suggestions on a better way to do this?
Thank you!
Upvotes: 1
Views: 274
Reputation: 1925
Easy, just extend Source
class with your own number generator: https://cloud.google.com/dataflow/model/custom-io
Upvotes: 1
Reputation: 444
For small dataset, you can just use pipeline.apply(Create.of(...)) to generate, but it won't scale (the generation code will be executed locally).
A better way may be:
List<Integer> l = ...; // 100k integers inside
pipeline.apply(Create.of(l)).apply(ParDo.of(new Generate100MDoFn())).apply(TextIO.Write.to(...));
so it will make dataflow generate a lot of data evenly in parallel.
Upvotes: 1