Michael Sherman
Michael Sherman

Reputation: 523

Generating data with Google Dataflow

Let's say I want to generate 100 trillion pieces of data (random numbers to keep it simple), and I'd like to use Google Dataflow to do it.

I can think of a dumb way to do this (I'm not 100% sure this would work, but this is where I'd start trying): take a text file that's 10 million lines long, and for every line in the input text file have a DoFn that loops for 10 million iterations, outputting a randomly generated number each iteration that are all eventually outputted to a text file. (whatever is in the original text file would just be ignored).

But I can't help but think there might be a better, less-hacky way to generate data using Dataflow. Any suggestions on a better way to do this?

Thank you!

Upvotes: 1

Views: 274

Answers (2)

Dzmitry Lazerka
Dzmitry Lazerka

Reputation: 1925

Easy, just extend Source class with your own number generator: https://cloud.google.com/dataflow/model/custom-io

Upvotes: 1

Zhou Yunqing
Zhou Yunqing

Reputation: 444

For small dataset, you can just use pipeline.apply(Create.of(...)) to generate, but it won't scale (the generation code will be executed locally).

A better way may be:

List<Integer> l = ...; // 100k integers inside
pipeline.apply(Create.of(l)).apply(ParDo.of(new Generate100MDoFn())).apply(TextIO.Write.to(...));

so it will make dataflow generate a lot of data evenly in parallel.

Upvotes: 1

Related Questions