Reputation: 405
I have csv files that have missing values per groups formed by primary keys (for every group, there's only 1 value populated for 1 field, and I need that field to be populated for all records of the group). I'm processing the entire file with apache beam and therefore, I want to use GroupByKey
to fill up the field for each group, and then ungroup it to restore the original data, now with filled data. The equivalent in pandas would be:
dataframe[column_to_be_filled] = dataframe.groupby(primary_key)[column_to_be_filled].ffill().bfill()
I don't know how to achieve this with apache beam. I first used apache beam dataframe, but that'd take a lot of memory.
Upvotes: 1
Views: 1213
Reputation: 656
You can try to use exactly the same code as Pandas in Beam: https://beam.apache.org/documentation/dsls/dataframes/overview/
You can use read_csv
to read your data into a dataframe, and then apply the same code that you would use in Pandas. Not all Pandas operations are supported (https://beam.apache.org/documentation/dsls/dataframes/differences-from-pandas/), but that specific case with the group by key should work.
Upvotes: 1
Reputation: 964
It's better to process your elements with a pcollection instead of a dataframe to avoid memory issues.
First read your CSV as a pcollection and then you can use GroupByKey
and process the grouped elements and yield the results with a separate transformation.
It could be something like this
(pcollection | 'Group by key' >> beam.GroupByKey()
| 'Process grouped elements' >> beam.ParDo(UngroupElements()))
The input pcollection should be list of tuples each one contains the key you want to group with and the element.
And the ptransformation would look like this:
class UngroupElements(beam.ParDo):
def process(element):
k, v = element
for elem in list(v):
# process your element
yield elem
Upvotes: 3