Reputation:
I have two pretty large DataFrame that are related as follows. The first DataFrame takes ID
as indices and specifies Min
and Max
. The second DataFrame has a bunch of scores and prices for each ID
.
For each ID
, I want to find the average of the prices that correspond to the scores that are between the max
and min scores
of that ID
.
An example:
dt = [[1,3],[0,2],[1,3],[0,3],[1,4]]
df = pd.DataFrame(dt, columns = ['Min score','Max score'],index=[1,2,3,4,5])
dt_2 = [[1,1,10],[1,4,12],[1,3,20],[1,1,3],[2,3,2],[2,1,4],[3,1,7],[3,2,9],[3,2,3],[4,4,4],[4,2,3],[4,1,2],[5,1,1],[5,3,5],[5,5,5],[5,0,10]]
df_2 = pd.DataFrame(dt_2,columns = ['ID','score','price'])
The output I want is
ID Avg
1 11
2 4
3 6.33
4 2.5
5 3
For example, for ID=1
, min score=1
and max score=3
, so only prices 10, 20 and 3 are relevant. The average of these is 11.
Is there a built-in pandas function that deals with this? Any help is appreciated. Thank you.
Upvotes: 1
Views: 73
Reputation: 29732
Use pandas.DataFrame.groupby
with Series.between
:
new_df = df_2.groupby("ID").apply(lambda x: x["price"][x["score"].between(*df.loc[x.name])].mean())
print(new_df)
Output:
ID
1 11.000000
2 4.000000
3 6.333333
4 2.500000
5 3.000000
dtype: float64
Added:
x.name
: value of the group key of groupby object, i.e. ID
in df_2
which is used as a look up value for df
df.loc[x.name]
: looking up Min Score
and Max Score
value based on the ID
from df_2
*df.loc[x.name]
: unpacks two values (min score and max score) into pandas.Series.between
to fill up left
and right
parameters.Upvotes: 2
Reputation: 16673
Just merge on ID
filter for the appropriate rows with .loc
and use .groupby
for the mean.
dt = [[1,3],[0,2],[1,3],[0,3],[1,4]]
df = pd.DataFrame(dt, columns = ['Min score','Max score'],index=[1,2,3,4,5])
df = df.reset_index().rename({'index' : 'ID'}, axis=1)
dt_2 = [[1,1,10],[1,4,12],[1,3,20],[1,1,3],[2,3,2],[2,1,4],[3,1,7],[3,2,9],[3,2,3],[4,4,4],[4,2,3],[4,1,2],[5,1,1],[5,3,5],[5,5,5],[5,0,10]]
df_2 = pd.DataFrame(dt_2,columns = ['ID','score','price'])
df_new = pd.merge(df_2, df, how='left', on='ID')
df_final = df_new.loc[(df_new['Min score'] <= df_new['score']) & (df_new['score'] <= df_new['Max score'])]
df_final = df_final.groupby('ID')['price'].mean().reset_index().rename({'price' : 'Avg'}, axis=1)
df_final
Upvotes: 0
Reputation: 102
I would make the dataframes ready for merging by resetting the index and renaming it to the 'ID' column and then merging the two dataframes naming it new_df:
new_df = df.reset_index().rename(columns={'index':'ID'}).merge(df_2,on='ID')
Then I would do the following .loc to make sure you have the min_score and max score only:
min_max_df = new_df.loc[(new_df['score'] >= new_df['Min score'])&(new_df['score']<=new_df['Max score'])]
And then I would use the groupby() pandas method with .mean() to get the average:
min_max_df.groupby('ID').mean()
Which would lead to the result:
Min score Max score score price
ID
1 1.0 3.0 1.666667 11.000000
2 0.0 2.0 1.000000 4.000000
3 1.0 3.0 1.666667 6.333333
4 0.0 3.0 1.500000 2.500000
5 1.0 4.0 2.000000 3.000000
Upvotes: 0
Reputation: 3926
Here you go:
df_3 = pd.merge(df_2, df, left_on="ID", right_index=True, how="left")
df_3[
(df_3.score >= df_3.loc[:, "Min score"]) & (df_3.score <= df_3.loc[:, "Max score"])
].groupby("ID").price.mean()
Upvotes: 1