Niels Brinch
Niels Brinch

Reputation: 3642

Sort anything lexicographically

I have this code which I think I found somewhere on the internet some years ago and it doesn't quite work.

The purpose is to take any string and from that create a string that is lexicographically sorted by a large number - because then inverse (descending) ordering can be achieved by subtracting the number from another even larger number.

private static BigInteger maxSort = new BigInteger(Encoding.Unicode.GetBytes("5335522543087813528200259404529154678271640415603227881439560533607051111046319775598721171814499900"));

public static string GetSortString(string str, bool descending)
{
    var sortNumber = new BigInteger(Encoding.Unicode.GetBytes(str));

    if (descending)
    {
        sortNumber = maxSort - sortNumber;
    }

    return "$SORT!" + sortNumber.ToString().PadLeft(100, '0') + ":" + str;
}

The reason I need this is because I want to use it to insert as RowKey in Azure Table Storage which is the only way to sort in Table Storage. I need to sort any text, any number and any date, both ascending and descending.

Can anyone see the issue with the code or have any code that serves the same purpose?

The question is tagged with C# but of course this is not a question of syntax so if you have the answer in any other code that would be fine too.

Example

I want to convert any string to a number which is lexicographically sorted correctly - because if it's a number, then I can invert it and sort descending.

So for example, if I can convert:

ABBA    to 1234
Beatles to 3131
ZZ Top  to 9584

Then those numbers would sort them correctly ... and, if I subtract them from a large number, I would be able to invert the sort order:

10000 - 1234 = 8766
10000 - 3131 = 6869
10000 - 9584 = 0416

Of course, to support longer text input, I need to subtract them from a very large number, which is why I use the very large BigInteger.

Current output from this code

ABBA:    $SORT!0000000000000000000000018296156958359617:ABBA
Beatles: $SORT!0000000009111360792640460912278748069954:Beatles
ZZ TOP:  $SORT!0000000000000096715522885596192519618650:ZZ TOP

As you can see, the longest text gets the highest number. I have also tried to add padding immediately on the input str, but that didnt help either.

Answer

The accepted answer worked. For descending sort order, the "BigInteger" trick from above could be used.

There is some limitation as to how long the sortable string can be.

Here is the final code:

private static BigInteger maxSort = new BigInteger(Encoding.Unicode.GetBytes("5335522543087813528200259404529154678271640415603227881439560533607051111046319775598721171814499900"));

public static string GetSortString(string str, bool descending)
{
    BigInteger result = 0;
    int maxLength = 42;

    foreach (var c in str.ToCharArray())
    {
        result = result * 256 + c;
    }

    for (int i = str.Length; i < maxLength; i++)
    {
        result = result * 256;
    }

    if (descending)
    {
        result = maxSort - result;
    }

    return "$SORT!" + result;
}

Upvotes: 0

Views: 405

Answers (1)

Surt
Surt

Reputation: 16109

If you were looking for a way to give a a value to any string so that you could sort them accordingly to the number and get the same result as above you can't. The reason is that strings don't have any length limit. Because you can always add a char to a string and thereby get a larger number even through it should have a lower lexicographical value.

If they have a length limit you can do something like this

pseudo code

bignum res = 0;
maxLength = 42;
for (char c : string)
  res = res * 256 + c
for (int i = string.length; i < maxLength; i++)
  res = res *256

If you want to optimize a bit, the last loop could be a lookup table. If your only using a-z, the times 256 could reduced to 26 or 32.

Upvotes: 1

Related Questions