Pablo
Pablo

Reputation: 11031

How do I Filter elements of a PCollection with a ParDo with Apache Beam Python SDK

I have a PCollection, and I would like to use a ParDo to filter out some elements from it.

Is there a place where I can find an example for this?

Upvotes: 5

Views: 8234

Answers (1)

Pablo
Pablo

Reputation: 11031

In the Apache Beam Python SDK, there is a Filter transform that receives a lambda, and filters out all elements that return False. Here is an example:

filtered_collection = (beam.Create([1, 2, 3, 4, 5])
                       beam.Filter(lambda x: x % 2 == 0))

In this case, filtered_collection will be a PCollection that contains 2, and 4.


If you want to code this as a DoFn that is passed to a ParDo transform, you would do something like this:

class FilteringDoFn(beam.DoFn):
  def process(self, element):
    if element % 2 == 0:
      yield element
    else:
      return  # Return nothing

and you can apply it like so:

filtered_collection = (beam.Create([1, 2, 3, 4, 5])
                       beam.ParDo(FilteringDoFn()))

where, like before, filtered_collection is a PCollection that contains 2, and 4.

Upvotes: 13

Related Questions