pa-nguyen
pa-nguyen

Reputation: 405

GroupByKey to fill values and then ungroup apache beam

I have csv files that have missing values per groups formed by primary keys (for every group, there's only 1 value populated for 1 field, and I need that field to be populated for all records of the group). I'm processing the entire file with apache beam and therefore, I want to use GroupByKey to fill up the field for each group, and then ungroup it to restore the original data, now with filled data. The equivalent in pandas would be:

dataframe[column_to_be_filled] = dataframe.groupby(primary_key)[column_to_be_filled].ffill().bfill()

I don't know how to achieve this with apache beam. I first used apache beam dataframe, but that'd take a lot of memory.

Upvotes: 1

Views: 1213

Answers (2)

Israel Herraiz
Israel Herraiz

Reputation: 656

You can try to use exactly the same code as Pandas in Beam: https://beam.apache.org/documentation/dsls/dataframes/overview/

You can use read_csv to read your data into a dataframe, and then apply the same code that you would use in Pandas. Not all Pandas operations are supported (https://beam.apache.org/documentation/dsls/dataframes/differences-from-pandas/), but that specific case with the group by key should work.

Upvotes: 1

Idhem
Idhem

Reputation: 964

It's better to process your elements with a pcollection instead of a dataframe to avoid memory issues.

First read your CSV as a pcollection and then you can use GroupByKey and process the grouped elements and yield the results with a separate transformation.

It could be something like this

(pcollection | 'Group by key' >> beam.GroupByKey()
                    | 'Process grouped elements' >> beam.ParDo(UngroupElements()))

The input pcollection should be list of tuples each one contains the key you want to group with and the element.

And the ptransformation would look like this:

class UngroupElements(beam.ParDo):
    
    def process(element):
        k, v = element
        for elem in list(v):
            # process your element 
            yield elem

Upvotes: 3

Related Questions