Reputation: 5373
Let us say that I have a data frame where I want to associate users with countries:
>>> dfUsers[['userId', 'country', 'lat']].dropna().groupby(['userId', 'country']).agg(len).reset_index()
userId country lat
0 1479705782818706665 India 1
1 1480576924651623757 India 12
2 1480576924651623757 РФ 2
3 1480928137574356334 Malaysia 17
4 1480988896538924406 India 1
5 1481723517601846740 Malaysia 2
6 1481810347655435765 Singapore 3
7 1481818704328005112 Singapore 6
8 1482457537889441352 Singapore 18
9 1482488858703566411 Singapore 1
10 1482730123382756957 India 1
11 1483106342385227382 Singapore 2
12 1483316566673069712 Malaysia 4
13 1484507758001657608 Singapore 6
14 1484654275131873053 Singapore 1
15 1484666213119301417 Singapore 1
16 1484734631705057076 Malaysia 4
What I want to do, is to associate the a user with a country. In this case, it is easy to see that the user 1480576924651623757
has two different countries associated with him/her. However, I want to associate this user with India
because the user has been in India more often than he/she has been in whatever that other country is ...
Is there a neat way of doing this? I can always loop over 'userId' and find the values corresponding to one that is larger. However, I am wondering if there is a way of doing this without the loop ...
Upvotes: 0
Views: 760
Reputation: 862511
It seems you need idxmax
for find max index per group by column lat
and then select by loc
:
df = df.loc[df.groupby('userId')['lat'].idxmax()]
print (df)
userId country lat
0 1479705782818706665 India 1
1 1480576924651623757 India 12 < 12 is max, so India
3 1480928137574356334 Malaysia 17
4 1480988896538924406 India 1
5 1481723517601846740 Malaysia 2
6 1481810347655435765 Singapore 3
7 1481818704328005112 Singapore 6
8 1482457537889441352 Singapore 18
9 1482488858703566411 Singapore 1
10 1482730123382756957 India 1
11 1483106342385227382 Singapore 2
12 1483316566673069712 Malaysia 4
13 1484507758001657608 Singapore 6
14 1484654275131873053 Singapore 1
15 1484666213119301417 Singapore 1
16 1484734631705057076 Malaysia 4
df = dfUsers[['userId', 'country', 'lat']].dropna()
.groupby(['userId', 'country'])
.size()
.reset_index(name='Count')
df = df.loc[df.groupby('userId')['Count'].idxmax()]
Upvotes: 1