Nick_FN
Nick_FN

Reputation: 11

Pandas - SUMIF equivalent on multiple keys across 2 dataframes

I'm trying to find a good method to combine dataframes together in what in Excel would be something like a SUMIF (or multiple SUMIFS).

I have the following two dataframes:

df = pd.DataFrame({ 'KEY_1': ['MF1020','MF1020','MF1540','MF5520','MF1525','MF6010'],
                    'KEY_2': ['MF0520','MF2020','MF2030','MF4520','MF1050','MF1020'],
                    'KEY_3': ['MF1020','MF2520','MF1020','MF2520','MF1020','MF1060'],
                    'KEY_COUNT1': [1,0.4,0,0,0.45,0.01],
                    'KEY_COUNT2': [0,0.3,0.9,0,0.05,0.01],
                    'KEY_COUNT3': [0,0.3,0.1,1,0.5,0.98],})


df1 = pd.DataFrame({'KEY': ['MF1020', 'MF0565', 'MF1540', 'MF5520', 'MF1525', 'MF6515', 'MF6010', 'MF0520','MF2020', 'MF2030', 'MF4520', 'MF1050', 'MF2520', 'MF1060'],
                    'KEY_COUNT': [0,0,0,0,0,0,0,0,0,0,0,0,0,0]})

In the first dataframe KEY_1 is linked to KEY_COUNT1 (likewise for 2, 3). In the second dataframe the KEY column contains all of the keys in the first dataframe across all three KEY columns, but also additional keys where the output should remain zero. There are many instances in the first dataframe where keys are repeated.

I want to sum the count columns where the key is equal to the key in the second dataframe.

I'm expecting the output to look as follows:

df_final = pd.DataFrame({   'KEY_1': ['MF1020', 'MF0565', 'MF1540', 'MF5520', 'MF1525', 'MF6515', 'MF6010', 'MF0520','MF2020', 'MF2030', 'MF4520', 'MF1050', 'MF2520', 'MF1060'],
                            'KEY_COUNT': [2.01,0,0,0,0.45,0,0.01,0,0.3,0.9,0,0.05,1.3,0.98]})

The way I have considered solving the issue is to create a dataframe for each KEY/KEY_COUNT combination, append them together and then do some kind of lookup to populate the values in the second dataframe. There must be a better solution!

Upvotes: 1

Views: 57

Answers (2)

e-motta
e-motta

Reputation: 7520

Assuming your columns always end with numbers 1...3, you can use this:

n = range(1, 4)

df = pd.melt(
    df,
    id_vars=[f"KEY_{i}" for i in n],
    value_vars=[f"KEY_COUNT{i}" for i in n],
    var_name="KEY_TYPE",
    value_name="KEY_COUNT",
)


df = (
    pd.concat(
        [
            df[df["KEY_TYPE"].str.endswith(str(i))]
            .filter([f"KEY_{i}", "KEY_COUNT"])
            .rename(columns={f"KEY_{i}": "KEY"})
            for i in n
        ]
    )
    .groupby("KEY")
    .sum()
)

df1["KEY_COUNT"] = df1["KEY"].map(df["KEY_COUNT"]).fillna(0)
       KEY  KEY_COUNT
0   MF1020       2.01
1   MF0565       0.00
2   MF1540       0.00
3   MF5520       0.00
4   MF1525       0.45
5   MF6515       0.00
6   MF6010       0.01
7   MF0520       0.00
8   MF2020       0.30
9   MF2030       0.90
10  MF4520       0.00
11  MF1050       0.05
12  MF2520       1.30
13  MF1060       0.98

Upvotes: 1

Oluwafemi Sule
Oluwafemi Sule

Reputation: 38952

You can use the DataFrame.melt method to transform the dataframe to a long format.

melted_df = df.melt( 
                    id_vars=[f'KEY_{i}' for i in range(4)], 
                    var_name='KEY_COUNT', 
                    value_name='COUNT')

Then extract the index number from the KEY_COUNT column.

melted_df['KEY_INDEX'] = melted_df['KEY_COUNT'].str.extract(r'(\d+)').astype(int)

Lookup the corresponding KEY_* column.

melted_df['KEY_INDEX'] = melted_df.apply(
    lambda row: row[f'KEY_{row["KEY_INDEX"]}'], axis=1)

Group the dataframe and sum the values.

final_df = melted_df[['KEY_INDEX', 'COUNT']].groupby('KEY_INDEX', as_index=False).sum()

Upvotes: 1

Related Questions