Reputation: 21552
I found some issues in this question, so I'd like to go a bit further. I have the following dataframe df
:
H,Nu,City,L,C
0.965392,15,Madrid,es,es
0.920614,15,Madrid,it,es
0.726219,16,Madrid,tn,es
0.739119,17,Madrid,fr,es
0.789923,55,Dublin,mt,en
0.699239,57,Dublin,en,en
0.890462,68,Dublin,ar,en
0.746863,68,Dublin,pt,en
0.789923,55,Milano,it,it
0.699239,57,Milano,es,it
0.890462,68,Milano,ar,it
0.746863,68,Milano,pt,it
I would like to add a column called Hm
defined as:
Hm = H/(H,City - where L==C)
So far I found the correct value / city with:
gp = df.groupby('City')
mask = gp.apply(lambda x: x['L'] == x['C'])
lookup = df.loc[mask[mask].reset_index(level=0).index]
obtaining:
HmCity Nu City L C
5 0.699239 57 Dublin en en
0 0.965392 15 Madrid es es
8 0.789923 55 Milano it it
H
values are now the correct values for the normalization. How can I now add a new column Hm so that the new dataframe is scaled in the corresponding positions of the lookup
? As, for instance:
H,Nu,City,L,C,Hm
0.965392,15,Madrid,es,es,1,0
0.920614,15,Madrid,it,es,**
0.726219,16,Madrid,tn,es,**
0.739119,17,Madrid,fr,es,**
0.789923,55,Dublin,mt,en,**
0.699239,57,Dublin,en,en,1,0
0.890462,68,Dublin,ar,en,**
0.746863,68,Dublin,pt,en,**
0.789923,55,Milano,it,it,1,0
0.699239,57,Milano,es,it,**
0.890462,68,Milano,ar,it,**
0.746863,68,Milano,pt,it,**
I would like to avoid the merge
because leads me to uncorrect behaviours.
EDIT:
to clarify: we can just add a new column Hm
containing the corresponding HmCity values for each city as:
H,Nu,City,L,C,HmCity
0.965392,15,Madrid,es,es,0.965392
0.920614,15,Madrid,it,es,0.965392
0.726219,16,Madrid,tn,es,0.965392
0.739119,17,Madrid,fr,es,0.965392
0.789923,55,Dublin,mt,en,0.699239
0.699239,57,Dublin,en,en,0.699239
0.890462,68,Dublin,ar,en,0.699239
0.746863,68,Dublin,pt,en,0.699239
0.789923,55,Milano,it,it,0.789923
0.699239,57,Milano,es,it,0.789923
0.890462,68,Milano,ar,it,0.789923
0.746863,68,Milano,pt,it,0.789923
Upvotes: 0
Views: 80
Reputation: 571
Based on your revised question, the following works for your particular dataset:
import pandas as pd
df = pd.DataFrame(
data=[[0.965392, 15, "Madrid", "es", "es"],
[0.920614, 15, "Madrid", "it", "es"],
[0.726219, 16, "Madrid", "tn", "es"],
[0.739119, 17, "Madrid", "fr", "es"],
[0.789923, 55, "Dublin", "mt", "en"],
[0.699239, 57, "Dublin", "en", "en"],
[0.890462, 68, "Dublin", "ar", "en"],
[0.746863, 68, "Dublin", "pt", "en"],
[0.789923, 55, "Milano", "it", "it"],
[0.699239, 57, "Milano", "es", "it"],
[0.890462, 68, "Milano", "ar", "it"],
[0.746863, 68, "Milano", "pt", "it"]],
columns=["H", "Nu", "City", "L", "C"])
def func(x):
x['Hm'] = x.loc[x['L'] == x['C'], "H"].values[0]
return x
print(df.groupby(["City"]).apply(func))
That gives the following output.
H Nu City L C Hm
0 0.965392 15 Madrid es es 0.965392
1 0.920614 15 Madrid it es 0.965392
2 0.726219 16 Madrid tn es 0.965392
3 0.739119 17 Madrid fr es 0.965392
4 0.789923 55 Dublin mt en 0.699239
5 0.699239 57 Dublin en en 0.699239
6 0.890462 68 Dublin ar en 0.699239
7 0.746863 68 Dublin pt en 0.699239
8 0.789923 55 Milano it it 0.789923
9 0.699239 57 Milano es it 0.789923
10 0.890462 68 Milano ar it 0.789923
11 0.746863 68 Milano pt it 0.789923
There may be a nicer way to do this though. I'll update the answer if I can come up with one.
Upvotes: 1