Merging many to many Dask

Question

say I have the following databases (suppose they are Dask data frames:

df A =

df B =

and I would like to merge the two so that the resulting DataFrame has the most information among the two (so for instance in the case of observation 1 I would like to preserve the info of df A, in case of observation number 3, I would like to preserve the info of df B and iso on...). In other words the resulting DataFrame should be like this:

df C=

Is there a way to do that in Dask?

Thank you

Gon&#231;alo Peres · Accepted Answer

Notes:

There are various ways to merge dask dataframes. Dask provides various built-in modules, such as: dask.dataframe.DataFrame.join, dask.dataframe.multi.concat, dask.dataframe.DataFrame.merge, dask.dataframe.multi.merge, dask.dataframe.multi.merge_asof. Depending on one's requirements one might want to use a specific one.
This thread has really valuable information on merges. Even though its focus is on Pandas, it will allow one to understand left, right, outer, and inner merges.

If one wants to do it with Pandas dataframes, there are various ways to achieve that.

One approach would creating a dataframe to store the dataframes that have the highest number of rows per sample_id, and then apply a custom made function. Let's invest a bit more time in that approach.

We will first create a dataframe to store the number of rows that each dataframe has per sample_id as follows

df_count = pd.DataFrame({'sample_id': df_a['sample_id'].unique()})

df_count['df_a'] = df_count['sample_id'].map(df_a.groupby('sample_id').size())

df_count['df_b'] = df_count['sample_id'].map(df_b.groupby('sample_id').size())

As it will be helpful, let us create a column df_max that will store the dataframe that has more rows per sample_id

df_count['df_max'] = df_count[['df_a', 'df_b']].idxmax(axis=1)

[Out]:
   sample_id  df_a  df_b df_max
0          1     2     1   df_a
1          2     4     2   df_a
2          3     1     3   df_b
3          4     1     1   df_a
4          5     6     3   df_a

A one-liner to create the desired df_count would look like the following

df_count = pd.DataFrame({'sample_id': df_a['sample_id'].unique()}).assign(df_a=lambda x: x['sample_id'].map(df_a.groupby('sample_id').size()), df_b=lambda x: x['sample_id'].map(df_b.groupby('sample_id').size()), df_max=lambda x: x[['df_a', 'df_b']].idxmax(axis=1))

Now, given df_a, df_b, and df_count, one will want a function to merge the dataframes based on a specific condition:

If df_max is df_a, then take the rows from df_a.
If df_max is df_b, then take the rows from df_b.

One can create a function merge_df that takes df_a, df_b, and df_count and returns the merged dataframe

def merge_df(df_a, df_b, df_count):

    # Create a list to store the dataframes
    df_list = []

    # Iterate over the rows in df_count
    for index, row in df_count.iterrows():

        # If df_max is df_a, then take the rows from df_a
        if row['df_max'] == 'df_a':
            df_list.append(df_a[df_a['sample_id'] == row['sample_id']])

        # If df_max is df_b, then take the rows from df_b
        elif row['df_max'] == 'df_b':
            df_list.append(df_b[df_b['sample_id'] == row['sample_id']])

        # If df_max is neither df_a nor df_b, then use the first dataframe
        else:
            df_list.append(df_a[df_a['sample_id'] == row['sample_id']])

    # Concatenate the dataframes in df_list and return the result. Also, reset the index.
    return pd.concat(df_list).reset_index(drop=True)

Then one can apply the function

df_merged = merge_df(df_a, df_b, df_count)

[Out]:
    sample_id
0           1
1           1
2           2
3           2
4           2
5           2
6           3
7           3
8           3
9           4
10          5
11          5
12          5
13          5
14          5
15          5

Merging many to many Dask

Answers (1)

Related Questions