Reputation: 171
I have a pandas dataframe that looks like this. The rows and the columns have the same name.
name a b c d e f g
a 10 5 4 8 5 6 4
b 5 10 6 5 4 3 3
c - 4 9 3 6 5 7
d 6 9 8 6 6 8 2
e 8 5 4 4 14 9 6
f 3 3 - 4 5 14 7
g 4 5 8 9 6 7 10
I can get the 5 number of largest values by passing df['column_name'].nlargest(n=5) but if I have to return 50 % of the largest in descending order, is there anything that is inbuilt in pandas of it I have to write a function for it, how can I get them? I am quite new to python. Please help me out.
UPDATE : So let's take column a into consideration and it has values like 10, 5,-,6,8,3 and 4. I have to sum all of them up and get the top 50% of them. so the total in this case is 36. 50% of these values would be 18. So from column a, I want to select 10 and 8 only. Similarly I want to go through all the other columns and select 50%.
Upvotes: 2
Views: 7700
Reputation: 15606
Sorting is flexible :)
df.sort_values('column_name',ascending=False).head(int(df.shape[0]*.5))
Update: frac argument is available only on .sample(), not in .head or .tail. df.sample(frac=.5) does give 50% but head and tail expects only int. df.head(frac=.5) fails with TypeError: head() got an unexpected keyword argument 'frac'
Note: on int() vs round()
int(3.X) == 3 # True Where 0 >= X >=9
round(3.45) == 3 # True
round(3.5) == 4 # True
So when doing .head(int/round ...) do think of what behaviour fits your need.
Updated: Requirements
So let's take column a into consideration and it has values like 10, 5,-,6,8,3 and 4. I have to sum all of them up and get the top 50% of them. so the total, in this case, is 36. 50% of these values would be 18. So from column a, I want to select 10 and 8 only. Similarly, I want to go through all the other columns and select 50%. -Matt
A silly hack would be to sort, find the cumulative sum, find the middle by dividing it with the sum total and then use that to select part of your sorted column. e.g.
import pandas as pd
data = pd.read_csv(
pd.compat.StringIO("""name a b c d e f g
a 10 5 4 8 5 6 4
b 5 10 6 5 4 3 3
c - 4 9 3 6 5 7
d 6 9 8 6 6 8 2
e 8 5 4 4 14 9 6
f 3 3 - 4 5 14 7
g 4 5 8 9 6 7 10"""),
sep=' ', index_col='name'
).dropna(axis=1).apply(
pd.to_numeric, errors='coerce', downcast='signed')
x = data[['a']].sort_values(by='a',ascending=False)[(data[['a']].sort_values(by='a',ascending=False).cumsum()
/data[['a']].sort_values(by='a',ascending=False).sum())<=.5].dropna()
print(x)
Upvotes: 9
Reputation: 691
You could sort the data frame and display only 90% of the data
df.sort_values('column_name',ascending=False).head(round(0.9*len(df)))
Upvotes: 1
Reputation: 31110
data.csv
name,a,b,c,d,e,f,g
a,10,5,4,8,5,6,4
b,5,10,6,5,4,3,3
c,-,4,9,3,6,5,7
d,6,9,8,6,6,8,2
e,8,5,4,4,14,9,6
f,3,3,-,4,5,14,7
g,4,5,8,9,6,7,10
test.py
#!/bin/python
import pandas as pd
def percentageOfList(l, p):
return l[0:int(len(l) * p)]
df = pd.read_csv('data.csv')
print(percentageOfList(df.sort_values('b', ascending=False)['b'], 0.9))
Upvotes: 0