Reputation: 9
I am very new to python, and here I have a question I don't know how to fix, please help.
Here is the thing: I have a dataframe, and I want to extract a column which meets two different conditions.
The columns is as follows:
state gender year name births
13299 AK F 2013 Emma 57
13300 AK F 2013 Sophia 50
13301 AK F 2013 Abigail 39
13302 AK F 2013 Isabella 38
13303 AK F 2013 Olivia 36
13304 AK F 2013 Charlotte 34
13305 AK F 2013 Harper 34
13306 AK F 2013 Emily 33
13307 AK F 2013 Ava 31
13308 AK F 2013 Avery 30
5742631 WY M 2013 Emmett 5
5742632 WY M 2013 Jesse 5
5742633 WY M 2013 Jonah 5
5742634 WY M 2013 Jude 5
5742635 WY M 2013 Kaden 5
5742636 WY M 2013 Kaleb 5
5742637 WY M 2013 Kasen 5
5742638 WY M 2013 Kellan 5
There is like 90K rows in this dataframe, I want to return the value of 'name' where the 'gender' column is as evenly distributed to 'M' and 'F' as possible.
Or in other words: I want to return the value of 'name' under the condition that 'births' columns contains same number of 'M' and 'F'.
Sorry I am new to Python, and I got stuck on this for quite awhile. I was trying to split the dataframe into two different dataframe, and do it that way, but I found it was kind of impossible.
Any suggestion would be appreciated.
Upvotes: 0
Views: 362
Reputation: 294516
I've defined df1
to further process. I've set the index to be ['name', 'gender']
, then unstack
to get 'gender' into columns. .births
to focus on births. Then I divide the min by the max to avoid dividing by zero.
df1 = df.set_index(['name', 'gender'], append=True).unstack().births.fillna(0)
df1.min(1).astype(float).div(df1.max(1)).sort_values(ascending=False)
This should give you a sorted dataframe by which name has a the closest ratio to 1.
Upvotes: 0
Reputation: 321
Pivot table in pandas works fine here:
pvt = pd.pivot_table(df,values='births',columns='gender',index='name',aggfunc='sum')
pvt[pvt['M'] == pvt['F']]
This returns a dataframe with name as an index and M,F for columns. It's unlikely that unisex names will be exactly equal though so you can instead do a multiconditional like
pvt[(pvt['M'] + 10 > pvt['F']) & (pvt['M'] - 10 < pvt['F'])]
Upvotes: 1