Tomp
Tomp

Reputation: 45

Pandas GroupBy and then choose one result using if or max

I have a dataframe of ids, states, places, types and populations:

id state place type population
A ontario toronto Place 5429524.0
A ontario ottawa Capital 989567.0
B saskatchewan saskatoon Place 246376.0
B saskatchewan regina Place 215106.0
B saskatchewan moose jaw Place 33890.0

I groupby to get all states with the same ID:

grouped_df = locations_df.groupby(["id", "state"])

I then need to either get the row that has ['type']== 'Capital' OR if that doesn't exist in that group, then choose the row with the highest population. e.g.

id state place type population
A ontario ottawa Capital 989567.0
B saskatchewan saskatoon Place 246376.0

I can get these individually using:

locations_df.groupby(["id", "state"]).population.idxmax()

locations_df.groupby(["id", "state"]).filter(lambda x: (x['type']== 'Capital').any())

Is there a way to combine this into one call?

Upvotes: 1

Views: 59

Answers (3)

user17242583
user17242583

Reputation:

Here's an alternate solution, just for fun:

filtered = df.loc[(df.groupby(['id', 'state'])['population'].transform('idxmax').eq(df.index) + df['type'].eq('Capital').mul(2)).groupby([df['id'], df['state']]).idxmax()]

Output:

>>> filtered
  id         state      place     type  population
1  A       ontario     ottawa  Capital    989567.0
2  B  saskatchewan  saskatoon    Place    246376.0

Upvotes: 1

Corralien
Corralien

Reputation: 120459

Sort values by type and population:

  • type by ascending order because 'C' is before 'P'
  • population by descending order to have highest values at top

Then group by id and state columns and finally take the first value of each group:

out = df.sort_values(['type', 'population'], ascending=[True, False]) \
        .groupby(['id', 'state']).first()
print(out)

# Output
  id         state      place     type  population
1  A       ontario     ottawa  Capital    989567.0
2  B  saskatchewan  saskatoon    Place    246376.0

Upvotes: 2

mozway
mozway

Reputation: 261300

You could use the fact that there are only two types and sort_values (ascending for "type" to have "Capital" first and descending for "population" to have the highest first):

(df.sort_values(by=['type', 'population'], ascending=[True, False])
   .groupby('id').first()
)

output:

            state       place      type  population
id                                                 
A        ontario      ottawa   Capital     989567.0
B   saskatchewan   saskatoon     Place     246376.0

Upvotes: 1

Related Questions