Reputation: 1654

Get first and second highest values in pandas columns

I am using pandas to analyse some election results. I have a DF, Results, which has a row for each constituency and columns representing the votes for the various parties (over 100 of them):

In[60]: Results.columns
Out[60]: 
Index(['Constituency', 'Region', 'Country', 'ID', 'Type', 'Electorate',
       'Total', 'Unnamed: 9', '30-50', 'Above',
       ...
       'WP', 'WRP', 'WVPTFP', 'Yorks', 'Young', 'Zeb', 'Party', 'Votes',
       'Share', 'Turnout'],
      dtype='object', length=147)

So...

In[63]: Results.head()
Out[63]: 
                         Constituency    Region   Country         ID    Type  \
PAID                                                                           
1                            Aberavon     Wales     Wales  W07000049  County   
2                           Aberconwy     Wales     Wales  W07000058  County   
3                      Aberdeen North  Scotland  Scotland  S14000001   Burgh   
4                      Aberdeen South  Scotland  Scotland  S14000002   Burgh   
5     Aberdeenshire West & Kincardine  Scotland  Scotland  S14000058  County   

      Electorate  Total  Unnamed: 9  30-50  Above    ...     WP  WRP  WVPTFP  \
PAID                                                 ...                       
1          49821  31523         NaN    NaN    NaN    ...    NaN  NaN     NaN   
2          45525  30148         NaN    NaN    NaN    ...    NaN  NaN     NaN   
3          67745  43936         NaN    NaN    NaN    ...    NaN  NaN     NaN   
4          68056  48551         NaN    NaN    NaN    ...    NaN  NaN     NaN   
5          73445  55196         NaN    NaN    NaN    ...    NaN  NaN     NaN   

      Yorks  Young  Zeb  Party  Votes     Share   Turnout  
PAID                                                       
1       NaN    NaN  NaN    Lab  15416  0.489040  0.632725  
2       NaN    NaN  NaN    Con  12513  0.415052  0.662230  
3       NaN    NaN  NaN    SNP  24793  0.564298  0.648550  
4       NaN    NaN  NaN    SNP  20221  0.416490  0.713398  
5       NaN    NaN  NaN    SNP  22949  0.415773  0.751528  

[5 rows x 147 columns]

The per-constituency results for each party are given in the columns Results.ix[:, 'Unnamed: 9': 'Zeb']

I can find the winning party (i.e. the party which polled highest number of votes) and the number of votes it polled using:

RawResults = Results.ix[:, 'Unnamed: 9': 'Zeb']
Results['Party'] = RawResults.idxmax(axis=1)
Results['Votes'] = RawResults.max(axis=1).astype(int)

But, I also need to know how many votes the second-place party got (and ideally its index/name). So is there any way in pandas to return the second highest value/index in a set of columns for each row?

Upvotes: 38

Answers (8)

Pedro Lobito

Reputation: 98871

To get the highest values of a column you can try nlargest() :

df['High'].nlargest(2)

The above will give you the 2 highest values of column High.

You can also use nsmallest() to get the lowest values.

Upvotes: 88

Feng Bao

Reputation: 91

Since the sort is applied to every column, the index labels will be dropped anyway. We can use numpy to achieve this:

a=np.random.random([10,4])
a
Out[10]: 
array([[0.11965879, 0.73168303, 0.18521419, 0.13992709],
       [0.8758143 , 0.13136781, 0.22071495, 0.9369399 ],
       [0.9190763 , 0.14320333, 0.66614619, 0.60688266],
       [0.91973194, 0.25297882, 0.02097999, 0.93834247],
       [0.70570349, 0.92229061, 0.64739799, 0.21614292],
       [0.70386848, 0.58542997, 0.78042638, 0.65968369],
       [0.37626386, 0.32668468, 0.42756616, 0.00938118],
       [0.85126718, 0.76305889, 0.03047206, 0.4874788 ],
       [0.88951106, 0.65035676, 0.98210505, 0.35605393],
       [0.96369325, 0.79311159, 0.81995022, 0.10588205]])
np.sort(a, axis=0)

We have output:

array([[0.11965879, 0.13136781, 0.02097999, 0.00938118],
       [0.37626386, 0.14320333, 0.03047206, 0.10588205],
       [0.70386848, 0.25297882, 0.18521419, 0.13992709],
       [0.70570349, 0.32668468, 0.22071495, 0.21614292],
       [0.85126718, 0.58542997, 0.42756616, 0.35605393],
       [0.8758143 , 0.65035676, 0.64739799, 0.4874788 ],
       [0.88951106, 0.73168303, 0.66614619, 0.60688266],
       [0.9190763 , 0.76305889, 0.78042638, 0.65968369],
       [0.91973194, 0.79311159, 0.81995022, 0.9369399 ],
       [0.96369325, 0.92229061, 0.98210505, 0.93834247]])

Here each column is sorted and you can slice the matrix to get the desired values.

Upvotes: 0

Fernando

Reputation: 21

df
           a           b           c            d           e           f           g          h
0   1.334444    0.322029    0.302296    -0.841236   -0.360488   -0.860188   -0.157942   1.522082
1   2.056572    0.991643    0.160067    -0.066473   0.235132    0.533202    1.282371    -2.050731
2   0.955586    -0.966734   0.055210    -0.993924   -0.553841   0.173793    -0.534548   -1.796006
3   1.201001    1.067291    -0.562357   -0.794284   -0.554820   -0.011836   0.519928    0.514669
4   -0.243972   -0.048144   0.498007    0.862016    1.284717    -0.886455   -0.757603   0.541992
5   0.739435    -0.767399   1.574173    1.197063    -1.147961   -0.903858   0.011073    -1.404868
6   -1.258282   -0.049719   0.400063    0.611456    0.443289    -1.110945   1.352029    0.215460
7   0.029121    -0.771431   -0.285119   -0.018216   0.408425    -1.458476   -1.363583   0.155134
8   1.427226    -1.005345   0.208665    -0.674917   0.287929    -1.259707   0.220420    -1.087245
9   0.452589    0.214592    -1.875423   0.487496    2.411265    0.062324    -0.327891   0.256577

tranpose and use nlargest in a for loop to get the results order by each line:

df1=df.T
results=list()
for col in df1.columns: results.append(df1[col].nlargest(len(df.columns))

the results var is a list of pandas objects, where the first item on the list will be the df's first row sorted in descending order and so on. Since each item on the list is a pandas object, it carries df's column as index (it was transposed), so you will get the values and the df's columns name of each row sorted

results
[h    1.522082
 a    1.334444
 b    0.322029
 c    0.302296
 g   -0.157942
 e   -0.360488
 d   -0.841236
 f   -0.860188
 Name: 0, dtype: float64,
 a    2.056572
 g    1.282371
 b    0.991643
 f    0.533202
 e    0.235132
 c    0.160067
 d   -0.066473
 h   -2.050731
 Name: 1, dtype: float64,
....

Upvotes: 1

Felix Le

Reputation: 21

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'a': [4, 5, 3, 1, 2],
    'b': [20, 10, 40, 50, 30],
    'c': [25, 20, 5, 15, 10]
})
def second_largest(df):
    
    return (df.nlargest(2).min())




print(df.apply(second_largest))
a     4
b    40
c    20
dtype: int64

Upvotes: 2

Amit Ghosh

Reputation: 1606

Here is an interesting approach. What if we replace the maximum value with the minimum value and calculate. Although it is a quick hack and, not recommended!

first_highest_value_index = df.idxmax()
second_highest_value_index = df.replace(df.max(),df(min)).idxmax()

first_highest_value = df[first_highest_value_index]
second_highest_value = df[second_highest_value_index]

Upvotes: 3

Darsh Shukla

Reputation: 389

Here is a solution using nlargest function:

>>> df
    a   b   c
 0  4  20   2
 1  5  10   2
 2  3  40   5
 3  1  50  10
 4  2  30  15
>>> def give_largest(col,n):
...     largest = col.nlargest(n).reset_index(drop = True)
...     data = [x for x in largest]
...     index = [f'{i}_largest' for i in range(1,len(largest)+1)]
...     return pd.Series(data,index=index)
...
...
>>> def n_largest(df, axis, n):
...     '''
...     Function to return the n-largest value of each
...     column/row of the input DataFrame.
...     '''
...     return df.apply(give_largest, axis = axis, n = n)
...
>>> n_largest(df,axis = 1, n = 2)
   1_largest  2_largest
0         20          4
1         10          5
2         40          5
3         50         10
4         30         15
>>> n_largest(df,axis = 0, n = 2)
                  a           b           c     
1_largest         5          50           15
2_largest         4          40           10

Upvotes: 2

MaxU - stand with Ukraine

Reputation: 210832

Here is a NumPy solution:

In [120]: df
Out[120]:
          a         b         c         d         e         f         g         h
0  1.334444  0.322029  0.302296 -0.841236 -0.360488 -0.860188 -0.157942  1.522082
1  2.056572  0.991643  0.160067 -0.066473  0.235132  0.533202  1.282371 -2.050731
2  0.955586 -0.966734  0.055210 -0.993924 -0.553841  0.173793 -0.534548 -1.796006
3  1.201001  1.067291 -0.562357 -0.794284 -0.554820 -0.011836  0.519928  0.514669
4 -0.243972 -0.048144  0.498007  0.862016  1.284717 -0.886455 -0.757603  0.541992
5  0.739435 -0.767399  1.574173  1.197063 -1.147961 -0.903858  0.011073 -1.404868
6 -1.258282 -0.049719  0.400063  0.611456  0.443289 -1.110945  1.352029  0.215460
7  0.029121 -0.771431 -0.285119 -0.018216  0.408425 -1.458476 -1.363583  0.155134
8  1.427226 -1.005345  0.208665 -0.674917  0.287929 -1.259707  0.220420 -1.087245
9  0.452589  0.214592 -1.875423  0.487496  2.411265  0.062324 -0.327891  0.256577

In [121]: np.sort(df.values)[:,-2:]
Out[121]:
array([[ 1.33444404,  1.52208164],
       [ 1.28237078,  2.05657214],
       [ 0.17379254,  0.95558613],
       [ 1.06729107,  1.20100071],
       [ 0.86201603,  1.28471676],
       [ 1.19706331,  1.57417327],
       [ 0.61145573,  1.35202868],
       [ 0.15513379,  0.40842477],
       [ 0.28792928,  1.42722604],
       [ 0.48749578,  2.41126532]])

or as a pandas Data Frame:

In [122]: pd.DataFrame(np.sort(df.values)[:,-2:], columns=['2nd-largest','largest'])
Out[122]:
   2nd-largest   largest
0     1.334444  1.522082
1     1.282371  2.056572
2     0.173793  0.955586
3     1.067291  1.201001
4     0.862016  1.284717
5     1.197063  1.574173
6     0.611456  1.352029
7     0.155134  0.408425
8     0.287929  1.427226
9     0.487496  2.411265

or a faster solution from @Divakar:

In [6]: df
Out[6]:
          a         b         c         d         e         f         g         h
0  0.649517 -0.223116  0.264734 -1.121666  0.151591 -1.335756 -0.155459 -2.500680
1  0.172981  1.233523  0.220378  1.188080 -0.289469 -0.039150  1.476852  0.736908
2 -1.904024  0.109314  0.045741 -0.341214 -0.332267 -1.363889  0.177705 -0.892018
3 -2.606532 -0.483314  0.054624  0.979734  0.205173  0.350247 -1.088776  1.501327
4  1.627655 -1.261631  0.589899 -0.660119  0.742390 -1.088103  0.228557  0.714746
5  0.423972 -0.506975 -0.783718 -2.044002 -0.692734  0.980399  1.007460  0.161516
6 -0.777123 -0.838311 -1.116104 -0.433797  0.599724 -0.884832 -0.086431 -0.738298
7  1.131621  1.218199  0.645709  0.066216 -0.265023  0.606963 -0.194694  0.463576
8  0.421164  0.626731 -0.547738  0.989820 -1.383061 -0.060413 -1.342769 -0.777907
9 -1.152690  0.696714 -0.155727 -0.991975 -0.806530  1.454522  0.788688  0.409516

In [7]: a = df.values

In [8]: a[np.arange(len(df))[:,None],np.argpartition(-a,np.arange(2),axis=1)[:,:2]]
Out[8]:
array([[ 0.64951665,  0.26473378],
       [ 1.47685226,  1.23352348],
       [ 0.17770473,  0.10931398],
       [ 1.50132666,  0.97973383],
       [ 1.62765464,  0.74238959],
       [ 1.00745981,  0.98039898],
       [ 0.5997243 , -0.0864306 ],
       [ 1.21819904,  1.13162068],
       [ 0.98982033,  0.62673128],
       [ 1.45452173,  0.78868785]])

Upvotes: 22

Kartik

Reputation: 8683

You could just sort your results, such that the first rows will contain the max. Then you can simply use indexing to get the first n places.

RawResults = Results.ix[:, 'Unnamed: 9': 'Zeb'].sort_values(by='votes', ascending=False)
RawResults.iloc[0, :] # First place
RawResults.iloc[1, :] # Second place
RawResults.iloc[n, :] # nth place

Upvotes: 3

Get first and second highest values in pandas columns

Answers (8)

Related Questions