Reputation: 379
I have a quite simple issue which I can't get solved:
I have a dataframe which looks like the following:
DB GOOG ATVI TSLA TWTR
1.0 -0.0129755 -0.00463688 -0.00408223 0.0653678 0.0779629
2.0 -0.052772 -0.00712359 -0.0120323 0.0230537 0.0245435
3.0 0.00875274 0.00762426 0.00176186 0.0834672 -0.0017326
4.0 -0.0125196 0.00657628 -0.00235884 0.0502074 0.0157572
5.0 -0.0470443 -0.00382168 -0.0153009 -0.0325997 -0.0235564
6.0 0.0140261 -0.00630647 -0.00265291 -0.037598 -0.0454938
7.0 0.000624415 -0.00429897 -0.00088587 -9.73558e-05 -0.0216945
8.0 -0.0138933 -0.00455289 -0.027357 -0.00682128 -0.0186916
9.0 -0.00311624 -0.000168211 -0.0100577 -0.00894537 -0.00181214
10.0 0.0864933 0.0151531 0.05061 0.0294589 0.0395802
For every row I want to find the column name corresponding to the largest, second largest and smallest value.
Example:
For the first row (index 1.0) I want to extract TWTR (largest value), TSLA (second largest value) and DB (smallest value).
...
For the fifth row (index 5.0) I want to extract GOOG (largest value), ATVI (second largest value) and DB (smallest value)
.. and so on.
What is the simplest way to do this? Thanks!
Upvotes: 1
Views: 1195
Reputation: 863731
You can use very fast numpy.argsort
for change columns names by sorted values per row:
print (np.argsort(-df.values, axis=1))
[[4 3 2 1 0]
[4 3 1 2 0]
[3 0 1 2 4]
[3 4 1 2 0]
[1 2 4 3 0]
[0 2 1 3 4]
[0 3 2 1 4]
[1 3 0 4 2]
[1 4 0 3 2]
[0 2 4 3 1]]
print (df.columns[np.argsort(-df.values, axis=1)])
Index([['TWTR', 'TSLA', 'ATVI', 'GOOG', 'DB'],
['TWTR', 'TSLA', 'GOOG', 'ATVI', 'DB'],
['TSLA', 'DB', 'GOOG', 'ATVI', 'TWTR'],
['TSLA', 'TWTR', 'GOOG', 'ATVI', 'DB'],
['GOOG', 'ATVI', 'TWTR', 'TSLA', 'DB'],
['DB', 'ATVI', 'GOOG', 'TSLA', 'TWTR'],
['DB', 'TSLA', 'ATVI', 'GOOG', 'TWTR'],
['GOOG', 'TSLA', 'DB', 'TWTR', 'ATVI'],
['GOOG', 'TWTR', 'DB', 'TSLA', 'ATVI'],
['DB', 'ATVI', 'TWTR', 'TSLA', 'GOOG']],
dtype='object')
df = pd.DataFrame(df.columns[np.argsort(-df.values, axis=1)][:,[0,1,-1]], index=df.index)
df.columns = ['largest','second largest','smallest']
print (df)
largest second largest smallest
1.0 TWTR TSLA DB
2.0 TWTR TSLA DB
3.0 TSLA DB TWTR
4.0 TSLA TWTR DB
5.0 GOOG ATVI DB
6.0 DB ATVI TWTR
7.0 DB TSLA TWTR
8.0 GOOG TSLA ATVI
9.0 GOOG TWTR ATVI
10.0 DB ATVI GOOG
Another solutions (slowier) with apply
and custom function where sort each row and get indexes
:
def f(x):
x = x.sort_values()
return pd.Series([x.index[-1], x.index[-2], x.index[0]],
index=['largest','second largest','smallest'])
df = df.apply(f ,axis=1)
print (df)
largest second largest smallest
1.0 TWTR TSLA DB
2.0 TWTR TSLA DB
3.0 TSLA DB TWTR
4.0 TSLA TWTR DB
5.0 GOOG ATVI DB
6.0 DB ATVI TWTR
7.0 DB TSLA TWTR
8.0 GOOG TSLA ATVI
9.0 GOOG TWTR ATVI
10.0 DB ATVI GOOG
df = df.apply(lambda x: x.sort_values().index ,axis=1)
df = df.iloc[:, [-1,-2,0]]
df.columns = ['largest','second largest','smallest']
print (df)
largest second largest smallest
1.0 TWTR TSLA DB
2.0 TWTR TSLA DB
3.0 TSLA DB TWTR
4.0 TSLA TWTR DB
5.0 GOOG ATVI DB
6.0 DB ATVI TWTR
7.0 DB TSLA TWTR
8.0 GOOG TSLA ATVI
9.0 GOOG TWTR ATVI
10.0 DB ATVI GOOG
Timings:
#[10000 rows x 5 columns]
df = pd.concat([df]*1000).reset_index(drop=True)
In [357]: %timeit pd.DataFrame(df.columns[np.argsort(-df.values, axis=1)][:,[0,1,-1]], index=df.index, columns=['largest','second largest','smallest'])
1000 loops, best of 3: 974 µs per loop
In [358]: %timeit df.apply(f ,axis=1)
1 loop, best of 3: 3.91 s per loop
In [361]: %timeit df.apply(lambda x: x.sort_values().index ,axis=1).iloc[:, [-1,-2,0]]
1 loop, best of 3: 1.88 s per loop
In [362]: %timeit df.apply(lambda x: x.sort_values().index.to_series().iloc[0], axis=1)
1 loop, best of 3: 2.47 s per loop
In [363]: %timeit df.apply(lambda x: x.sort_values(ascending=False).index.to_series().iloc[0], axis=1)
1 loop, best of 3: 2.51 s per loop
In [364]: %timeit df.apply(lambda x: x.sort_values(ascending=False).index.to_series().iloc[1], axis=1)
1 loop, best of 3: 2.52 s per loop
In [365]: %timeit [df.T.sort_values(by=k).T.keys()[-1] for k in df.T.keys()]
1 loop, best of 3: 6.42 s per loop
Upvotes: 2
Reputation: 781
Not sure if it is the simplest solution, but it works:
df =pd.DataFrame(np.random.rand(5,5), columns=['a','b','c','d','e'])
largest = [df.T.sort_values(by=k).T.keys()[-1] for k in df.T.keys()]
largest2 = [df.T.sort_values(by=k).T.keys()[-2] for k in df.T.keys()]
smallest = [df.T.sort_values(by=k).T.keys()[0] for k in df.T.keys()]
Upvotes: 1
Reputation: 6794
You can uses apply
based on the row axis and then sort_values
to order the data. Then you use integer-based indexing on the sorted series' index to get what you want:
# smallest value
df.apply(lambda x: x.sort_values().index[0], axis=1)
# biggest value
df.apply(lambda x: x.sort_values(ascending=False).index[0], axis=1)
# 2nd biggest value
df.apply(lambda x: x.sort_values(ascending=False).index[1], axis=1)
If applied, the output looks like:
In [38]: df.apply(lambda x: x.sort_values().index.to_series().iloc[0], axis=1)
Out[38]:
1.0 DB
2.0 DB
3.0 TWTR
4.0 DB
5.0 DB
6.0 TWTR
7.0 TWTR
8.0 ATVI
9.0 ATVI
10.0 GOOG
dtype: object
In [39]: df.apply(lambda x: x.sort_values(ascending=False).index.to_series().iloc[0], axis=1)
Out[39]:
1.0 TWTR
2.0 TWTR
3.0 TSLA
4.0 TSLA
5.0 GOOG
6.0 DB
7.0 DB
8.0 GOOG
9.0 GOOG
10.0 DB
dtype: object
In [40]: df.apply(lambda x: x.sort_values(ascending=False).index.to_series().iloc[1], axis=1)
Out[40]:
1.0 TSLA
2.0 TSLA
3.0 DB
4.0 TWTR
5.0 ATVI
6.0 ATVI
7.0 TSLA
8.0 TSLA
9.0 TWTR
10.0 ATVI
dtype: object
Upvotes: 0