Matt
Matt

Reputation: 171

Selecting the top 50 % percentage names from the columns of a pandas dataframe

I have a pandas dataframe that looks like this. The rows and the columns have the same name.

name a  b  c  d  e  f  g 
 a   10 5  4  8  5  6  4
 b   5  10 6  5  4  3  3
 c   -  4  9  3  6  5  7
 d   6  9  8  6  6  8  2
 e   8  5  4  4  14 9  6
 f   3  3  -  4  5  14 7
 g   4  5  8  9  6  7  10

I can get the 5 number of largest values by passing df['column_name'].nlargest(n=5) but if I have to return 50 % of the largest in descending order, is there anything that is inbuilt in pandas of it I have to write a function for it, how can I get them? I am quite new to python. Please help me out.

UPDATE : So let's take column a into consideration and it has values like 10, 5,-,6,8,3 and 4. I have to sum all of them up and get the top 50% of them. so the total in this case is 36. 50% of these values would be 18. So from column a, I want to select 10 and 8 only. Similarly I want to go through all the other columns and select 50%.

Upvotes: 2

Views: 7700

Answers (3)

Prayson W. Daniel
Prayson W. Daniel

Reputation: 15606

Sorting is flexible :)

df.sort_values('column_name',ascending=False).head(int(df.shape[0]*.5))

Update: frac argument is available only on .sample(), not in .head or .tail. df.sample(frac=.5) does give 50% but head and tail expects only int. df.head(frac=.5) fails with TypeError: head() got an unexpected keyword argument 'frac'

Note: on int() vs round()

int(3.X) == 3 # True Where 0 >= X >=9 
round(3.45) == 3 # True
round(3.5) == 4 # True

So when doing .head(int/round ...) do think of what behaviour fits your need.

Updated: Requirements

So let's take column a into consideration and it has values like 10, 5,-,6,8,3 and 4. I have to sum all of them up and get the top 50% of them. so the total, in this case, is 36. 50% of these values would be 18. So from column a, I want to select 10 and 8 only. Similarly, I want to go through all the other columns and select 50%. -Matt

A silly hack would be to sort, find the cumulative sum, find the middle by dividing it with the sum total and then use that to select part of your sorted column. e.g.

import pandas as pd

data = pd.read_csv(
pd.compat.StringIO("""name a b c d e f g 
a 10 5 4 8 5 6 4
b 5 10 6 5 4 3 3
c - 4 9 3 6 5 7
d 6 9 8 6 6 8 2
e 8 5 4 4 14 9 6
f 3 3 - 4 5 14 7
g 4 5 8 9 6 7 10"""), 
sep=' ', index_col='name'
).dropna(axis=1).apply(
pd.to_numeric, errors='coerce', downcast='signed')

x = data[['a']].sort_values(by='a',ascending=False)[(data[['a']].sort_values(by='a',ascending=False).cumsum()
                                                 /data[['a']].sort_values(by='a',ascending=False).sum())<=.5].dropna()
print(x)

Outcome:enter image description here

Upvotes: 9

Ashok KS
Ashok KS

Reputation: 691

You could sort the data frame and display only 90% of the data

df.sort_values('column_name',ascending=False).head(round(0.9*len(df)))

Upvotes: 1

Conner
Conner

Reputation: 31110

data.csv

name,a,b,c,d,e,f,g
a,10,5,4,8,5,6,4
b,5,10,6,5,4,3,3
c,-,4,9,3,6,5,7
d,6,9,8,6,6,8,2
e,8,5,4,4,14,9,6
f,3,3,-,4,5,14,7
g,4,5,8,9,6,7,10

test.py

#!/bin/python

import pandas as pd

def percentageOfList(l, p):
    return l[0:int(len(l) * p)]

df = pd.read_csv('data.csv')
print(percentageOfList(df.sort_values('b', ascending=False)['b'], 0.9))

Upvotes: 0

Related Questions