Reputation: 649
I'm looking for a smooth way to sort a pandas Series by data descending, followed by index ascending. I've been looking around in the docs and on Stackoverflow but couldn't find a straightforward way.
The Series has approximately 5000 entries and is the result of a tf-idf analysis with NLTK.
However, below I provide a very small sample of the data to illustrate the problem.
import pandas as pd
index = ['146tf150p', 'anytime', '645', 'blank', 'anything']
tfidf = [1.000000, 1.000000, 1.000000, 0.932702, 1.000000]
tfidfmax = pd.Series(tfidf, index=index)
For now I'm just converting the Series to a DataFrame, resetting the index, doing the sort and then setting the index, but I feel this is a big detour.
frame = pd.DataFrame(tfidfmax , columns=['data']).reset_index().sort_values(['data','index'], ascending=[False, True]).set_index(['index'])
3.02 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I'm looking forward to your suggestions!
Upvotes: 4
Views: 2079
Reputation: 863176
Use function sorted
by zip
both list
s create new Series
by zip:
index = ['146tf150p', 'anytime', '645', 'blank', 'anything']
tfidf = [1.000000, 1.000000, 2.000000, 0.932702, 2.000000]
a = list(zip(*sorted(zip(index, tfidf),key=lambda x:(-x[1],x[0]))))
#if input is Series
#a = list(zip(*sorted(zip(tfidfmax.index,tfidfmax),key=lambda x:(-x[1],x[0]))))
s = pd.Series(a[1], index=a[0])
print (s)
645 2.000000
anything 2.000000
146tf150p 1.000000
anytime 1.000000
blank 0.932702
dtype: float64
Upvotes: 3
Reputation: 7838
simple:
In [15]: pd.Series(tfidfmax.sort_values(ascending=False),index=tfidfmax.sort_index().index)
Out[15]:
146tf150p 1.000000
645 1.000000
anything 1.000000
anytime 1.000000
blank 0.932702
dtype: float64
or faster way:
In [26]: pd.Series(-np.sort(-tfidfmax),index=np.sort(tfidfmax.index))
Out[26]:
146tf150p 1.000000
645 1.000000
anything 1.000000
anytime 1.000000
blank 0.932702
dtype: float64
In [17]: %timeit tfidfmax[np.lexsort((tfidfmax.index, -tfidfmax.values))]
10000 loops, best of 3: 104 µs per loop
In [18]: %timeit pd.Series(tfidfmax.sort_values(ascending=False),index=tfidfmax.sort_index().index)
1000 loops, best of 3: 406 µs per loop
In [27]: %timeit pd.Series(-np.sort(-tfidfmax),index=np.sort(tfidfmax.index))
10000 loops, best of 3: 91.2 µs per loop
Upvotes: 1
Reputation: 164773
You can use numpy.lexsort
for this:
res = tfidfmax[np.lexsort((tfidfmax.index, -tfidfmax.values))]
print(res)
# 146tf150p 1.000000
# 645 1.000000
# anything 1.000000
# anytime 1.000000
# blank 0.932702
# dtype: float64
Note the reverse order in the syntax: the above code first sorts by descending values, then by index ascending.
Upvotes: 6