Maximilian
Maximilian

Reputation: 8450

Sample in Dataflow / Beam with Python

I'm trying to get a sample of the items in PCollection using the Python SDK on Dataflow / Beam.

While it's not documented, Sample.FixedSizeGlobally(n) exists.

When testing, it seems to return a PCollection with a single item: a list containing the samples, rather than a PCollection with the samples. Is that correct?

Is doing this the best way of turning that single-item PCollection into a PCollection of the items?

| Sample.FixedSizeGlobally(sample_size)
| beam.FlatMap(lambda x: x)

Upvotes: 2

Views: 1561

Answers (1)

Pablo
Pablo

Reputation: 11021

Currently, yes. The Sample.FixedSizeGlobally() transform returns a PCollection with a single list element. You can turn it into a PCollection of single elements like you said:

Sample.FixedSizeGlobally(sample_size)
| beam.FlatMap(lambda x: x)

We'll make sure to add a PC-PC transform - and we also welcome your contributions to Beam : ) - But in the meantime, that's what we've got.

Upvotes: 4

Related Questions