MRocklin
MRocklin

Reputation: 57281

Check if a numpy array is lexicographically sorted

How do I check if a sequence of numpy arrays are lexicographically sorted?

>>> x = np.asarray([0, 0, 1, 1])
>>> y = np.asarray([0, 1, 0, 2])
>>> is_lex_sorted([x, y])
True

>>> x = np.asarray([100, 0, 1, 1])
>>> y = np.asarray([0, 1, 0, 2])
>>> is_lex_sorted([x, y])
False

Ideally this check would run in linear time and run at NumPy speeds. Ideally also it would be able to return quickly when it first encounters a false result.

Upvotes: 1

Views: 374

Answers (2)

Alex Riley
Alex Riley

Reputation: 176948

An implementation of is_lexsorted using pure-NumPy functions would almost certainly need to make several passes over one or more of the arrays (since NumPy functions are designed to operate on entire arrays in one go).

This means that writing the function in numba or Cython may be a better option if speed is a primary concern.

There's one such Cython function already implemented in pandas as is_lexsorted. This has the short-circuiting behaviour you specify and its running time is linear with the number of, and length of, arrays.

>>> from pandas.algos import is_lexsorted 
>>> x = np.asarray([0, 0, 1, 1])
>>> y = np.asarray([0, 1, 0, 2])
>>> is_lexsorted([x, y])
True

(For pandas 0.20.0+ use from pandas.lib import is_lexsorted.)

In pandas it's used to check if levels of MultiIndexes are sorted or not. However, beware, it's not in the public API and so could change in future.

Upvotes: 2

Warren Weckesser
Warren Weckesser

Reputation: 114921

If n (the maxmimum integer in the arrays) and the number of columns are not too large, you can convert each row to an integer in a way that preserves the lexicographic ordering:

def is_lex_sorted(a):
    n = a.max() + 1
    v = (a*(n**np.arange(a.shape[1]-1, -1, -1))).sum(axis=1)
    return np.all(v[:-1] <= v[1:])

For example,

In [230]: x
Out[230]: 
array([[0, 0, 1],
       [0, 1, 2],
       [1, 0, 4],
       [1, 2, 1]])

In [231]: is_lex_sorted(x)
Out[231]: True

In [232]: y
Out[232]: 
array([[  0,   0,   1,  13],
       [  0,   1,   2,  30],
       [100,   0,   4,  10],
       [  1,  20,   1,  80]])

In [233]: is_lex_sorted(y)
Out[233]: False

Two problems:

  • It will only work if the n and the number of columns are "small". If they are not, the integers used in the calculation will overflow and the result will be wrong.
  • It does not stop early.

Upvotes: 2

Related Questions