Gulzar
Gulzar

Reputation: 28044

How to assign a value to each group by a groupby query on another dataframe?

Let the following dfs:

import numpy as np
import pandas as pd

df1 = pd.DataFrame({
    "k1": [1, 1, 2, 2, 3, 3, 4, 4, 4],
})

df2 = pd.DataFrame({
    "k2": [1, 1, 2, 2, 3, 4, 4],
    "v2": np.random.rand(7)
})

print(df1)
print("_______")
print(df2)
print("_______")

out:

   k1
0   1
1   1
2   2
3   2
4   3
5   3
6   4
7   4
8   4
_______
   k2        v2
0   1  0.260026
1   1  0.474951
2   2  0.695962
3   2  0.158575
4   3  0.396015
5   4  0.740344
6   4  0.293410
_______

I want to create a new column for df1 such that for every key k1, a corresponding value will be applied such that if k1 == k2, the value will be a function (say max) of v2 of the group in df2 whose key is k2 (k1).

Required output for above case:

   k1  result
0   1  0.474951
1   1  0.474951
2   2  0.695962
3   2  0.695962
4   3  0.396015
5   3  0.396015
6   4  0.740344
7   4  0.740344
8   4  0.740344

It can be assumed that all keys present in k1 are also in k2.


This is probably done with two groupby operations, one for query and one for assignment, but I can't figure out how to tie together the output of one to the input of the other.


Edit:
Please notice the example k1 and k2 are sorted for clarity, but are not guaranteed to be. I also don't want to sort because of o(nlogn) time, and this can be done in o(n)

Upvotes: 1

Views: 76

Answers (2)

Ynjxsjmh
Ynjxsjmh

Reputation: 30050

First, you can sort on k2 and v2 columns in df2 to ensure that the bigger value in column v2 stay on first. Then drop duplicates on k2 to keep the first which is the max. At last, map v2 column in k2 to df1.

df1['result'] = df1['k1'].map(df2.sort_values(['k2', 'v2'], ascending=[True, False]).drop_duplicates('k2', keep='first').set_index('k2')['v2'])
print(df1)

   k1        result
0   1  0.303764
1   1  0.303764
2   2  0.026024
3   2  0.026024
4   3  0.213834
5   3  0.213834
6   4  0.757031
7   4  0.757031
8   4  0.757031

Upvotes: 1

Shubham Sharma
Shubham Sharma

Reputation: 71707

We can try map and groupby

df1['result'] = df1['k1'].map(df2.groupby('k2')['v2'].max())

   k1    result
0   1  0.474951
1   1  0.474951
2   2  0.695962
3   2  0.695962
4   3  0.396015
5   3  0.396015
6   4  0.740344
7   4  0.740344
8   4  0.740344

Upvotes: 3

Related Questions