Reputation: 45
I have a dataframe of ids, states, places, types and populations:
id | state | place | type | population |
---|---|---|---|---|
A | ontario | toronto | Place | 5429524.0 |
A | ontario | ottawa | Capital | 989567.0 |
B | saskatchewan | saskatoon | Place | 246376.0 |
B | saskatchewan | regina | Place | 215106.0 |
B | saskatchewan | moose jaw | Place | 33890.0 |
I groupby to get all states with the same ID:
grouped_df = locations_df.groupby(["id", "state"])
I then need to either get the row that has ['type']== 'Capital' OR if that doesn't exist in that group, then choose the row with the highest population. e.g.
id | state | place | type | population |
---|---|---|---|---|
A | ontario | ottawa | Capital | 989567.0 |
B | saskatchewan | saskatoon | Place | 246376.0 |
I can get these individually using:
locations_df.groupby(["id", "state"]).population.idxmax()
locations_df.groupby(["id", "state"]).filter(lambda x: (x['type']== 'Capital').any())
Is there a way to combine this into one call?
Upvotes: 1
Views: 59
Reputation:
Here's an alternate solution, just for fun:
filtered = df.loc[(df.groupby(['id', 'state'])['population'].transform('idxmax').eq(df.index) + df['type'].eq('Capital').mul(2)).groupby([df['id'], df['state']]).idxmax()]
Output:
>>> filtered
id state place type population
1 A ontario ottawa Capital 989567.0
2 B saskatchewan saskatoon Place 246376.0
Upvotes: 1
Reputation: 120459
Sort values by type
and population
:
type
by ascending order because 'C' is before 'P'population
by descending order to have highest values at topThen group by id
and state
columns and finally take the first value of each group:
out = df.sort_values(['type', 'population'], ascending=[True, False]) \
.groupby(['id', 'state']).first()
print(out)
# Output
id state place type population
1 A ontario ottawa Capital 989567.0
2 B saskatchewan saskatoon Place 246376.0
Upvotes: 2
Reputation: 261300
You could use the fact that there are only two types and sort_values
(ascending for "type" to have "Capital" first and descending for "population" to have the highest first):
(df.sort_values(by=['type', 'population'], ascending=[True, False])
.groupby('id').first()
)
output:
state place type population
id
A ontario ottawa Capital 989567.0
B saskatchewan saskatoon Place 246376.0
Upvotes: 1