Li Shuwei
Li Shuwei

Reputation: 9

Return a column which meet two conditions with pandas dataframe

I am very new to python, and here I have a question I don't know how to fix, please help.

Here is the thing: I have a dataframe, and I want to extract a column which meets two different conditions.

The columns is as follows:

     state gender  year       name  births
13299      AK      F  2013       Emma      57
13300      AK      F  2013     Sophia      50
13301      AK      F  2013    Abigail      39
13302      AK      F  2013   Isabella      38
13303      AK      F  2013     Olivia      36
13304      AK      F  2013  Charlotte      34
13305      AK      F  2013     Harper      34
13306      AK      F  2013      Emily      33
13307      AK      F  2013        Ava      31
13308      AK      F  2013      Avery      30
5742631    WY      M  2013     Emmett       5
5742632    WY      M  2013      Jesse       5
5742633    WY      M  2013      Jonah       5
5742634    WY      M  2013       Jude       5
5742635    WY      M  2013      Kaden       5
5742636    WY      M  2013      Kaleb       5
5742637    WY      M  2013      Kasen       5
5742638    WY      M  2013     Kellan       5

There is like 90K rows in this dataframe, I want to return the value of 'name' where the 'gender' column is as evenly distributed to 'M' and 'F' as possible.

Or in other words: I want to return the value of 'name' under the condition that 'births' columns contains same number of 'M' and 'F'.

Sorry I am new to Python, and I got stuck on this for quite awhile. I was trying to split the dataframe into two different dataframe, and do it that way, but I found it was kind of impossible.

Any suggestion would be appreciated.

Upvotes: 0

Views: 362

Answers (2)

piRSquared
piRSquared

Reputation: 294516

I've defined df1 to further process. I've set the index to be ['name', 'gender'], then unstack to get 'gender' into columns. .births to focus on births. Then I divide the min by the max to avoid dividing by zero.

df1 = df.set_index(['name', 'gender'], append=True).unstack().births.fillna(0)
df1.min(1).astype(float).div(df1.max(1)).sort_values(ascending=False)

This should give you a sorted dataframe by which name has a the closest ratio to 1.

Upvotes: 0

breucopter
breucopter

Reputation: 321

Pivot table in pandas works fine here:

pvt = pd.pivot_table(df,values='births',columns='gender',index='name',aggfunc='sum')
pvt[pvt['M'] == pvt['F']]

This returns a dataframe with name as an index and M,F for columns. It's unlikely that unisex names will be exactly equal though so you can instead do a multiconditional like

pvt[(pvt['M'] + 10 > pvt['F']) & (pvt['M'] - 10 < pvt['F'])]

Upvotes: 1

Related Questions