Sorting Strings with separators in pandas

Question

I have a pandas series with strings with separators in them, like say:

['160.20.2257.92', '829.328.17.39']

I want to sort them. If I use Seres.sort_values() like in the below code:

a = pd.Series(['6.0.0.0', '10.0.4.0'])
a.sort_values()

I get the output as :

1    10.0.4.0
0     6.0.0.0

which is quite expected since the sorting function compares 6 with 1 not 6 with 10 and since 1 is smaller it is displayed first in sorted order. What I want is it to be sorted by the first part before the separator('.'), followed by the 2nd part, and so on(i.e compare 10 & 6, followed by 0 & 0, followed by 4 & 0, finally 0 & 0)

What is the best way in Pandas, in terms of speed to achieve this, since I am dealing with a large dataset?

Mikhail Stepanov · Accepted Answer

It is an extension of @Ananay Mital's answer - if you wish to save an index.

def sort_data(series): 
    sorted_data = sorted(
                      zip(series.index, map(lambda x: tuple(map(int, x.split('.'))), series)), 
                      key=lambda x: x[1])  
    return pd.Series(
               map(lambda x: '.'.join(map(str, x[1])), sorted_data),
               index=[i[0] for i in sorted_data])

sorted_data = sort_data(series)

Example:

series
Out:
0       2.49.30.91
1      2.16.99.113
2     62.38.107.41
3     127.21.16.56
4     14.97.112.42
5      49.25.90.11
6      92.87.23.61
7    87.121.78.112
8      17.73.95.37
9     28.117.48.89
dtype: object

Sorted series:

sort_data(series)
Out:
1      2.16.99.113
0       2.49.30.91
4     14.97.112.42
8      17.73.95.37
9     28.117.48.89
5      49.25.90.11
2     62.38.107.41
7    87.121.78.112
6      92.87.23.61
3     127.21.16.56
dtype: object

There's no simple approach to do what you looking for with a standard pandas methods; also, some my attemtpts to do that with pandas was not so fast as @Ananay Mital's solution.

This approach is fast enough (approx. time to sort 100.000 ip's ~4 ms, 1.000.000 ip's ~84 ms).

If you want it work faster because dataset is huge and O(n)/O(n*log(n)) difference matters, you can implement counting sort or radix sort (only if the data consists of ip's or so - domain should be restricted). Even implemented in python, such algorithm will be faster than built-in sorted because of O(n) complexity.

Sorting Strings with separators in pandas

Answers (2)

Related Questions