Hjin
Hjin

Reputation: 322

How to group and get three most frequent value?

i want to group by id and get three most frequent city. For example i have original dataframe

  ID    City
    1    London
    1    London
    1    New York
    1    London
    1    New York
    1    Berlin
    2    Shanghai
    2    Shanghai

and result i want is like this:

ID first_frequent_city   second_frequent_city   third_frequent_city
1   London               New York               Berlin
2   Shanghai             NaN                    NaN

Upvotes: 2

Views: 1014

Answers (4)

Henry Yik
Henry Yik

Reputation: 22503

Another way using count as a reference to sort, then recreate dataframe by looping through groupby object:

df = (df.assign(count=df.groupby(["ID","City"])["City"].transform("count"))
        .drop_duplicates(["ID","City"])
        .sort_values(["ID","count"], ascending=False))
    
print (pd.DataFrame([i["City"].unique()[:3] for _, i in df.groupby("ID")]).fillna(np.NaN))

          0         1       2
0    London  New York  Berlin
1  Shanghai       NaN     NaN

Upvotes: 2

jezrael
jezrael

Reputation: 862671

First step is use SeriesGroupBy.value_counts for count values of City per ID, advantage is already values are sorted, then get counter by GroupBy.cumcount, filter first 3 values by loc, pivoting by DataFrame.pivot, change columns names and last convert ID to column by DataFrame.reset_index:

df = (df.groupby('ID')['City'].value_counts()
        .groupby(level=0).cumcount()
        .loc[lambda x: x < 3]
        .reset_index(name='c')
        .pivot('ID','c','City')
        .rename(columns={0:'first_', 1:'second_', 2:'third_'})
        .add_suffix('frequent_city')
        .rename_axis(None, axis=1)
        .reset_index())
print (df)
   ID first_frequent_city second_frequent_city third_frequent_city
0   1              London             New York              Berlin
1   2            Shanghai                  NaN                 NaN

Upvotes: 4

David Erickson
David Erickson

Reputation: 16683

Get the .count by ID and City and then use np.where() with .groupby() with max, median and min. Then set the index and unstack rows to columns on the max column.

df = df.assign(count=df.groupby(['ID', 'City'])['City'].transform('count')).drop_duplicates()
df['max'] = np.where((df['count'] == df.groupby('ID')['count'].transform('min')), 'third_frequent_city', np.nan)
df['max'] = np.where((df['count'] == df.groupby('ID')['count'].transform('median')), 'second_frequent_city', df['max'])
df['max'] = np.where((df['count'] == df.groupby('ID')['count'].transform('max')), 'first_frequent_city', df['max'])
df = df.drop('count',axis=1).set_index(['ID', 'max']).unstack(1)

output:

    City
max first_frequent_city second_frequent_city    third_frequent_city
ID          
1   London              New York                Berlin
2   Shanghai            NaN                     NaN

Upvotes: 1

sammywemmy
sammywemmy

Reputation: 28644

A bit long, essentially you groupby twice, first part works on the idea that grouping sorts the data in ascending order, the second part allows us to split the data into individual columns :

(df
.groupby("ID")
.tail(3)
.drop_duplicates()
.groupby("ID")
.agg(",".join)
.City.str.split(",", expand=True)
.set_axis(["first_frequent_city",
           "second_frequent_city", 
           third_frequent_city"],
           axis="columns",)
)


     first_frequent_city    second_frequent_city    third_frequent_city
ID          
1      London                 New York                Berlin
2      Shanghai               None                    None

Upvotes: 1

Related Questions