Reputation: 379

Select column of pandas dataframe based on value

I have a quite simple issue which I can't get solved:

I have a dataframe which looks like the following:

               DB         GOOG        ATVI         TSLA        TWTR
1.0    -0.0129755  -0.00463688 -0.00408223    0.0653678   0.0779629
2.0     -0.052772  -0.00712359  -0.0120323    0.0230537   0.0245435
3.0    0.00875274   0.00762426  0.00176186    0.0834672  -0.0017326
4.0    -0.0125196   0.00657628 -0.00235884    0.0502074   0.0157572
5.0    -0.0470443  -0.00382168  -0.0153009   -0.0325997  -0.0235564
6.0     0.0140261  -0.00630647 -0.00265291    -0.037598  -0.0454938
7.0   0.000624415  -0.00429897 -0.00088587 -9.73558e-05  -0.0216945
8.0    -0.0138933  -0.00455289   -0.027357  -0.00682128  -0.0186916
9.0   -0.00311624 -0.000168211  -0.0100577  -0.00894537 -0.00181214
10.0    0.0864933    0.0151531     0.05061    0.0294589   0.0395802

For every row I want to find the column name corresponding to the largest, second largest and smallest value.

Example:

For the first row (index 1.0) I want to extract TWTR (largest value), TSLA (second largest value) and DB (smallest value).

...

For the fifth row (index 5.0) I want to extract GOOG (largest value), ATVI (second largest value) and DB (smallest value)

.. and so on.

What is the simplest way to do this? Thanks!

Upvotes: 1

Answers (3)

jezrael

Reputation: 863731

You can use very fast numpy.argsort for change columns names by sorted values per row:

print (np.argsort(-df.values, axis=1))
[[4 3 2 1 0]
 [4 3 1 2 0]
 [3 0 1 2 4]
 [3 4 1 2 0]
 [1 2 4 3 0]
 [0 2 1 3 4]
 [0 3 2 1 4]
 [1 3 0 4 2]
 [1 4 0 3 2]
 [0 2 4 3 1]]

print (df.columns[np.argsort(-df.values, axis=1)])
Index([['TWTR', 'TSLA', 'ATVI', 'GOOG', 'DB'],
       ['TWTR', 'TSLA', 'GOOG', 'ATVI', 'DB'],
       ['TSLA', 'DB', 'GOOG', 'ATVI', 'TWTR'],
       ['TSLA', 'TWTR', 'GOOG', 'ATVI', 'DB'],
       ['GOOG', 'ATVI', 'TWTR', 'TSLA', 'DB'],
       ['DB', 'ATVI', 'GOOG', 'TSLA', 'TWTR'],
       ['DB', 'TSLA', 'ATVI', 'GOOG', 'TWTR'],
       ['GOOG', 'TSLA', 'DB', 'TWTR', 'ATVI'],
       ['GOOG', 'TWTR', 'DB', 'TSLA', 'ATVI'],
       ['DB', 'ATVI', 'TWTR', 'TSLA', 'GOOG']],
      dtype='object')

df = pd.DataFrame(df.columns[np.argsort(-df.values, axis=1)][:,[0,1,-1]], index=df.index)
df.columns = ['largest','second largest','smallest']
print (df)
     largest second largest smallest
1.0     TWTR           TSLA       DB
2.0     TWTR           TSLA       DB
3.0     TSLA             DB     TWTR
4.0     TSLA           TWTR       DB
5.0     GOOG           ATVI       DB
6.0       DB           ATVI     TWTR
7.0       DB           TSLA     TWTR
8.0     GOOG           TSLA     ATVI
9.0     GOOG           TWTR     ATVI
10.0      DB           ATVI     GOOG

Another solutions (slowier) with apply and custom function where sort each row and get indexes:

def f(x):
    x = x.sort_values()
    return pd.Series([x.index[-1], x.index[-2], x.index[0]],
                     index=['largest','second largest','smallest'])

df = df.apply(f ,axis=1)
print (df)
     largest second largest smallest
1.0     TWTR           TSLA       DB
2.0     TWTR           TSLA       DB
3.0     TSLA             DB     TWTR
4.0     TSLA           TWTR       DB
5.0     GOOG           ATVI       DB
6.0       DB           ATVI     TWTR
7.0       DB           TSLA     TWTR
8.0     GOOG           TSLA     ATVI
9.0     GOOG           TWTR     ATVI
10.0      DB           ATVI     GOOG

df = df.apply(lambda x: x.sort_values().index ,axis=1)
df = df.iloc[:, [-1,-2,0]]
df.columns = ['largest','second largest','smallest']
print (df)
     largest second largest smallest
1.0     TWTR           TSLA       DB
2.0     TWTR           TSLA       DB
3.0     TSLA             DB     TWTR
4.0     TSLA           TWTR       DB
5.0     GOOG           ATVI       DB
6.0       DB           ATVI     TWTR
7.0       DB           TSLA     TWTR
8.0     GOOG           TSLA     ATVI
9.0     GOOG           TWTR     ATVI
10.0      DB           ATVI     GOOG

Timings:

#[10000 rows x 5 columns]
df = pd.concat([df]*1000).reset_index(drop=True)

In [357]: %timeit pd.DataFrame(df.columns[np.argsort(-df.values, axis=1)][:,[0,1,-1]], index=df.index, columns=['largest','second largest','smallest'])
1000 loops, best of 3: 974 µs per loop

In [358]: %timeit df.apply(f ,axis=1)
1 loop, best of 3: 3.91 s per loop

In [361]: %timeit df.apply(lambda x: x.sort_values().index ,axis=1).iloc[:, [-1,-2,0]]
1 loop, best of 3: 1.88 s per loop

In [362]: %timeit df.apply(lambda x: x.sort_values().index.to_series().iloc[0], axis=1)
1 loop, best of 3: 2.47 s per loop

In [363]: %timeit df.apply(lambda x: x.sort_values(ascending=False).index.to_series().iloc[0], axis=1)
1 loop, best of 3: 2.51 s per loop

In [364]: %timeit df.apply(lambda x: x.sort_values(ascending=False).index.to_series().iloc[1], axis=1)
1 loop, best of 3: 2.52 s per loop

In [365]: %timeit [df.T.sort_values(by=k).T.keys()[-1] for k in df.T.keys()]
1 loop, best of 3: 6.42 s per loop

Upvotes: 2

Pavel Prochazka

Reputation: 781

Not sure if it is the simplest solution, but it works:

df =pd.DataFrame(np.random.rand(5,5), columns=['a','b','c','d','e'])
largest = [df.T.sort_values(by=k).T.keys()[-1] for k in df.T.keys()]
largest2 = [df.T.sort_values(by=k).T.keys()[-2] for k in df.T.keys()]
smallest = [df.T.sort_values(by=k).T.keys()[0] for k in df.T.keys()]

Upvotes: 1

languitar

Reputation: 6794

You can uses apply based on the row axis and then sort_values to order the data. Then you use integer-based indexing on the sorted series' index to get what you want:

# smallest value
df.apply(lambda x: x.sort_values().index[0], axis=1)
# biggest value
df.apply(lambda x: x.sort_values(ascending=False).index[0], axis=1)
# 2nd biggest value
df.apply(lambda x: x.sort_values(ascending=False).index[1], axis=1)

If applied, the output looks like:

In [38]: df.apply(lambda x: x.sort_values().index.to_series().iloc[0], axis=1)
Out[38]: 
1.0       DB
2.0       DB
3.0     TWTR
4.0       DB
5.0       DB
6.0     TWTR
7.0     TWTR
8.0     ATVI
9.0     ATVI
10.0    GOOG
dtype: object

In [39]: df.apply(lambda x: x.sort_values(ascending=False).index.to_series().iloc[0], axis=1)
Out[39]: 
1.0     TWTR
2.0     TWTR
3.0     TSLA
4.0     TSLA
5.0     GOOG
6.0       DB
7.0       DB
8.0     GOOG
9.0     GOOG
10.0      DB
dtype: object

In [40]: df.apply(lambda x: x.sort_values(ascending=False).index.to_series().iloc[1], axis=1)
Out[40]: 
1.0     TSLA
2.0     TSLA
3.0       DB
4.0     TWTR
5.0     ATVI
6.0     ATVI
7.0     TSLA
8.0     TSLA
9.0     TWTR
10.0    ATVI
dtype: object

Upvotes: 0

Select column of pandas dataframe based on value

Answers (3)

Related Questions