Reputation: 1250
I have a pandas series with strings with separators in them, like say:
['160.20.2257.92', '829.328.17.39']
I want to sort them. If I use Seres.sort_values() like in the below code:
a = pd.Series(['6.0.0.0', '10.0.4.0'])
a.sort_values()
I get the output as :
1 10.0.4.0
0 6.0.0.0
which is quite expected since the sorting function compares 6 with 1 not 6 with 10 and since 1 is smaller it is displayed first in sorted order. What I want is it to be sorted by the first part before the separator('.'), followed by the 2nd part, and so on(i.e compare 10 & 6, followed by 0 & 0, followed by 4 & 0, finally 0 & 0)
What is the best way in Pandas, in terms of speed to achieve this, since I am dealing with a large dataset?
Upvotes: 0
Views: 223
Reputation: 3790
It is an extension of @Ananay Mital's answer - if you wish to save an index.
def sort_data(series):
sorted_data = sorted(
zip(series.index, map(lambda x: tuple(map(int, x.split('.'))), series)),
key=lambda x: x[1])
return pd.Series(
map(lambda x: '.'.join(map(str, x[1])), sorted_data),
index=[i[0] for i in sorted_data])
sorted_data = sort_data(series)
Example:
series
Out:
0 2.49.30.91
1 2.16.99.113
2 62.38.107.41
3 127.21.16.56
4 14.97.112.42
5 49.25.90.11
6 92.87.23.61
7 87.121.78.112
8 17.73.95.37
9 28.117.48.89
dtype: object
Sorted series:
sort_data(series)
Out:
1 2.16.99.113
0 2.49.30.91
4 14.97.112.42
8 17.73.95.37
9 28.117.48.89
5 49.25.90.11
2 62.38.107.41
7 87.121.78.112
6 92.87.23.61
3 127.21.16.56
dtype: object
There's no simple approach to do what you looking for with a standard pandas
methods; also, some my attemtpts to do that with pandas was not so fast as @Ananay Mital's solution.
This approach is fast enough (approx. time to sort 100.000 ip's ~4 ms, 1.000.000 ip's ~84 ms).
If you want it work faster because dataset is huge and O(n)/O(n*log(n))
difference matters, you can implement counting sort or radix sort (only if the data consists of ip's or so - domain should be restricted). Even implemented in python
, such algorithm will be faster than built-in sorted
because of O(n)
complexity.
Upvotes: 1
Reputation: 1475
I believe this is what you are looking for
a = ['160.20.2257.92', '829.328.17.39']
b = sorted(map(lambda x: tuple(map(int, x.split('.'))), a))
final = map(lambda x: '.'.join(map(str, x)), b)
final
['160.20.2257.92', '829.328.17.39']
I hope this covers all corner cases
Upvotes: 2