Reputation: 1651
I am using Dataflow and Apache Beam to process a dataset and store the result in a headerless csv file with two columns, something like this:
A1,a
A2,a
A3,b
A4,a
A5,c
...
I want to filter out certain entries based on the following two conditions:
1- In the second column, if the number of occurrences of a certain value is less than N
, then remove all such rows. For instance if N=10
and c
only appears 7 times, then I want all those rows to be filtered out.
2- In the second column, if the number of occurrences of a certain value is more than M
, then only keep M
many of such rows and filter out the rest. For instance if M=1000
and a
appears 1200 times, then I want 200 of such entries to be filtered out, and the other 1000 cases to be stored in the csv file.
In other words, I want to make sure all elements of the second columns appear more than N
and less than M
many times.
My question is whether this is possible by using some filter in Beam? Or should it be done as a post-process step once the csv file is created and saved?
Upvotes: 1
Views: 1587
Reputation: 1383
You can use beam.Filter
to filter out all the second column values that matches your range's lower bound condition into a PCollection.
Then correlate that PCollection (as a side input) with your original PCollection to filter out all the lines that need to be excluded.
As for the upperbound, since you want to keep any upperbound amount of elements instead of excluding them completely, you should do some post processing or come up with some combine transforms to do that.
An example with Python SDK using word count.
class ReadWordsFromText(beam.PTransform):
def __init__(self, file_pattern):
self._file_pattern = file_pattern
def expand(self, pcoll):
return (pcoll.pipeline
| beam.io.ReadFromText(self._file_pattern)
| beam.FlatMap(lambda line: re.findall(r'[\w\']+', line.strip(), re.UNICODE)))
p = beam.Pipeline()
words = (p
| 'read' >> ReadWordsFromText('gs://apache-beam-samples/shakespeare/kinglear.txt')
| "lower" >> beam.Map(lambda word: word.lower()))
import random
# Assume this is the data PCollection you want to do filter on.
data = words | beam.Map(lambda word: (word, random.randint(1, 101)))
counts = (words
| 'count' >> beam.combiners.Count.PerElement())
words_with_counts_bigger_than_100 = counts | beam.Filter(lambda count: count[1] > 100) | beam.Map(lambda count: count[0])
Now you get a pcollection like
def cross_join(left, rights):
for x in rights:
if left[0] == x:
yield (left, x)
data_with_word_counts_bigger_than_100 = data | beam.FlatMap(cross_join, rights=beam.pvalue.AsIter(words_with_counts_bigger_than_100))
Now you filtered out elements below lowerbound from the data set and get
Note the 66
from ('king', 66)
is the fake random data I put in.
To debug with such visualizations, you can use interactive beam. You can setup your own notebook runtime following instructions; Or you can use hosted solutions provided by Google Dataflow Notebooks.
Upvotes: 2