Reputation: 123

Python: Sort list of lists numerically

I have a list of x,y coordinates that I need to sort based on the x coordinate, then y coordinate when x is the same and eliminate duplicates of the same coordinates. For example, if the list is:

[[450.0, 486.6], [500.0, 400.0], [450.0, 313.3], [350.0, 313.3], [300.0, 400.0], 
 [349.9, 486.6], [450.0, 313.3]]

I would need to rearrange it to:

[[300.0, 400.0], [349.9, 486.6], [350.0, 313.3], [450.0, 313.3], [450.0, 486.6],
 [500.0, 400.0]]

(with one duplicate of [450.0, 313.3] removed)

Upvotes: 4

Answers (5)

user3483203

Reputation: 51185

We can do this quite fast using np.lexsort and some masking

def lex(arr):                 
    tmp =  arr[np.lexsort(arr.T),:]
    tmp = tmp[np.append([True],np.any(np.diff(tmp,axis=0),1))]
    return tmp[np.lexsort((tmp[:, 1], tmp[:, 0]), axis=0)]

L = np.array(L)
lex(L)

# Output:
[[300.  400. ]
 [349.9 486.6]
 [350.  313.3]
 [450.  313.3]
 [450.  486.6]
 [500.  400. ]]

Performance

`Functions`

def chrisz(arr):                 
    tmp =  arr[np.lexsort(arr.T),:]
    tmp = tmp[np.append([True],np.any(np.diff(tmp,axis=0),1))]
    return tmp[np.lexsort((tmp[:, 1], tmp[:, 0]), axis=0)]

def pp(data):
    return [k for k, g in itertools.groupby(sorted(data))]

def gazer(data):
    return np.unique(data, axis=0)

def wim(L):
    return sorted({tuple(x): x for x in L}.values())

def jpp(L):
    return sorted(unique(L, key=tuple))

`Setup`

res = pd.DataFrame(
       index=['chrisz', 'pp', 'gazer', 'wim', 'jpp'],
       columns=[10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000],
       dtype=float
)

for f in res.index: 
    for c in res.columns:
        npL = np.random.randint(1,1000,(c,2)) + np.random.choice(np.random.random(1000), (c, 2))
        L = npL.tolist()
        stmt = '{}(npL)'.format(f) if f in {'chrisz', 'gazer'} else '{}(L)'.format(f)
        setp = 'from __main__ import L, npL, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True) 
ax.set_xlabel("N"); 
ax.set_ylabel("time (relative)");

plt.show()

`Validation`

npL = np.random.randint(1,1000,(100000,2)) + np.random.choice(np.random.random(1000), (100000, 2))    
L = npL.tolist()    
chrisz(npL).tolist() == pp(L) == gazer(npL).tolist() == wim(L) == jpp(L)
True

Upvotes: 2

AGN Gazer

Reputation: 8388

What you want seems to be easily done with numpy's unique function:

import numpy as np
u = np.unique(data, axis=0) # or np.unique(data, axis=0).tolist()

If you are really worried that the array is not sorted by columns, then run np.lexsort() in addition to the above:

u = u[np.lexsort((u[:,1], u[:,0]))]

Timings (non-random sample):

In [1]: import numpy as np

In [2]: from toolz import unique

In [3]: data = [[450.0, 486.6], [500.0, 400.0], [450.0, 313.3],
   ...:  [350.0, 313.3], [300.0, 400.0], [349.9, 486.6], [450.0, 313.3]]
   ...:  

In [4]: L = 100000 * data

In [5]: npL = np.array(L)

In [6]: %timeit sorted(unique(L, key=tuple))
125 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit sorted({tuple(x): x for x in L}.values())
139 ms ± 3.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit np.unique(L, axis=0)
732 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %timeit np.unique(npL, axis=0)
584 ms ± 8.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# @user3483203 solution:

In [57]: %timeit lex(np.asarray(L))
227 ms ± 8.34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [58]: %timeit lex(npL)
76.2 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Timings (more random sample):

When sample data are more random, the results are different:

In [29]: npL = np.random.randint(1,1000,(100000,2)) + np.random.choice(np.random.random(1000), (100000, 2))

In [30]: L = npL.tolist()

In [31]: %timeit sorted(unique(L, key=tuple))
143 ms ± 2.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [32]: %timeit sorted({tuple(x): x for x in L}.values())
134 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [33]: %timeit np.unique(L, axis=0)
78.5 ms ± 1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [34]: %timeit np.unique(npL, axis=0)
54 ms ± 398 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# @Paul Panzer's solution:

In [36]: import itertools

In [37]: %timeit [k for k, g in itertools.groupby(sorted(L))]
123 ms ± 3.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# @user3483203 solution:

In [54]: %timeit lex(np.asarray(L))
60.1 ms ± 744 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [55]: %timeit lex(npL)
38.8 ms ± 728 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 2

jpp

Reputation: 164823

Here's one way using sorted and toolz.unique:

from toolz import unique

res = sorted(unique(L, key=tuple))

print(res)

[[300.0, 400.0], [349.9, 486.6], [350.0, 313.3],
 [450.0, 313.3], [450.0, 486.6], [500.0, 400.0]]

Note toolz.unique is also available via the standard library as the itertools unique_everseen recipe. Tuple conversion is necessary as the algorithm uses hashing via set to check uniqueness.

Performance using set appears slightly better than dict here, but as always you should test with your data.

L = L*100000

%timeit sorted(unique(L, key=tuple))               # 223 ms
%timeit sorted({tuple(x): x for x in L}.values())  # 243 ms

I suspect this is because unique is lazy, and so you have less memory overhead since sorted isn't making a copy of the input data.

Upvotes: 0

Paul Panzer

Reputation: 53099

As we are sorting anyway we can dedupe with groupby:

>>> import itertools
>>> [k for k, g in itertools.groupby(sorted(data))]                                                                 
[[300.0, 400.0], [349.9, 486.6], [350.0, 313.3], [450.0, 313.3], [450.0, 486.6], [500.0, 400.0]]

A few timings:

>>> import numpy as np # just to create a large example
>>> a = np.random.randint(0, 215, (10000, 2)).tolist()
>>> len([k for k, g in groupby(sorted(a))])
8977 # ~ 10% duplicates
>>> 
>>> timeit("[k for k, g in groupby(sorted(a))]", globals=globals(), number=1000)
6.1627248489967315
>>> timeit("sorted({tuple(x): x for x in a}.values())", globals=globals(), number=1000)
6.654527607999626
>>> timeit("sorted(unique(a, key=tuple))", globals=globals(), number=1000)
7.198703720991034
>>> timeit("np.unique(a, axis=0).tolist()", globals=globals(), number=1000)
8.848866895001265

Upvotes: 2

wim

Reputation: 363476

That is the normal sort order for a list of lists, anyway. De-dupe it with a dict.

>>> L = [[450.0, 486.6], [500.0, 400.0], [450.0, 313.3], [350.0, 313.3], [300.0, 400.0], [349.9, 486.6], [450.0, 313.3]]
>>> sorted({tuple(x): x for x in L}.values())
[[300.0, 400.0],
 [349.9, 486.6],
 [350.0, 313.3],
 [450.0, 313.3],
 [450.0, 486.6],
 [500.0, 400.0]]

Upvotes: 6

Python: Sort list of lists numerically

Answers (5)

Performance

Functions

Setup

Validation

Timings (non-random sample):

Timings (more random sample):

Related Questions

`Functions`

`Setup`

`Validation`