Reputation: 1199
i have a dataframe as shown below. The elements in col_1
are connected to the elements in col_2
, giving results in output_1
. However, some elements in col_2
are now in col_1
as well. For example, while a-->b= 3
, b-->a= 24
.
col_1 col_2 output_1 average
a b 3 13.5 (because a-->b=3 and b-->a=24)
a c 5 3.5 (because a-->c=5 and c-->a=2)
a d 3
b a 24
b c 12
b d 5
c a 2
c b 3
c d 5
What i need is to calculate the average of these two values and of course all similar cases across the dataframe.
You can think of the data LIKE THIS: people in col_1
are calling people in col_2
. The output is the duration
. I want to calculate the average duration between every pair of people. col_1
and col_2
have string values while the third column "output" has the numeric values.
I have tried using pd.merge(df.col_1, df.col_2)
but didn't work.
Any suggestions will be appreciated.
Upvotes: 0
Views: 123
Reputation: 13407
Try this. Column col_12
you can either drop, or use further as a pair unique key (irrespective to elements order).
print(df)
df["col_12"]=df[["col_1", "col_2"]].apply(lambda x: str(sorted(x)), axis=1)
df2=df.groupby(df["col_12"]).agg({"output_1": "mean", "col_1": "count"}).rename(columns={"output_1": "output_1_mean", "col_1": "rows_count"})
df2.loc[df2["rows_count"]==1, "output_1_mean"]/=2
df2.drop("rows_count", axis=1, inplace=True)
df=df.join(df2, on="col_12")
print(df)
And output:
col_1 col_2 output_1
0 a b 3
1 a c 5
2 a d 3
3 b a 24
4 b c 12
5 b d 5
6 c a 2
7 c b 3
8 c d 5
col_1 col_2 output_1 col_12 output_1_mean
0 a b 3 ['a', 'b'] 13.5
1 a c 5 ['a', 'c'] 3.5
2 a d 3 ['a', 'd'] 1.5
3 b a 24 ['a', 'b'] 13.5
4 b c 12 ['b', 'c'] 7.5
5 b d 5 ['b', 'd'] 2.5
6 c a 2 ['a', 'c'] 3.5
7 c b 3 ['b', 'c'] 7.5
8 c d 5 ['c', 'd'] 2.5
[Program finished]
Upvotes: 1
Reputation: 274
EDITED
You may try
for ii in a['col_1'].unique():
p = pd.merge(a[a['col_1'] == ii], a[a['col_2'] == ii], left_on = 'col_2', right_on = 'col_1', left_index = True)
a.loc[p.index, 'mean'] = p.mean(axis = 1)
thanks to @baccandr for the correction
Upvotes: 1
Reputation: 1130
I would use numpy broadcasting:
i,j=np.where((df.col_1+df.col_2).values==(df.col_2+df.col_1).values[:,None])
average=0.5*(df.iloc[i].output_1.reset_index(drop=True)+\
df.iloc[j].output_1.reset_index(drop=True))
average.index=df.iloc[i].index
df['average']=average
The result I got is the following:
col_1 col_2 output_1 average
0 a b 3 13.5
1 a c 5 3.5
2 a d 3 NaN
3 b a 24 13.5
4 b c 12 7.5
5 b d 5 NaN
6 c a 2 3.5
7 c b 3 7.5
8 c d 5 NaN
Upvotes: 2
Reputation: 31
You can try .mean()
. click here for documentation.
Try this : df['average']=df[['col_1','col_2']].mean(axis=1)
Upvotes: 0