William Scantlin
William Scantlin

Reputation: 123

Python: Sort list of lists numerically

I have a list of x,y coordinates that I need to sort based on the x coordinate, then y coordinate when x is the same and eliminate duplicates of the same coordinates. For example, if the list is:

[[450.0, 486.6], [500.0, 400.0], [450.0, 313.3], [350.0, 313.3], [300.0, 400.0], 
 [349.9, 486.6], [450.0, 313.3]]

I would need to rearrange it to:

[[300.0, 400.0], [349.9, 486.6], [350.0, 313.3], [450.0, 313.3], [450.0, 486.6],
 [500.0, 400.0]]

(with one duplicate of [450.0, 313.3] removed)

Upvotes: 4

Views: 977

Answers (5)

user3483203
user3483203

Reputation: 51165

We can do this quite fast using np.lexsort and some masking

def lex(arr):                 
    tmp =  arr[np.lexsort(arr.T),:]
    tmp = tmp[np.append([True],np.any(np.diff(tmp,axis=0),1))]
    return tmp[np.lexsort((tmp[:, 1], tmp[:, 0]), axis=0)]

L = np.array(L)
lex(L)

# Output:
[[300.  400. ]
 [349.9 486.6]
 [350.  313.3]
 [450.  313.3]
 [450.  486.6]
 [500.  400. ]]

Performance

Functions

def chrisz(arr):                 
    tmp =  arr[np.lexsort(arr.T),:]
    tmp = tmp[np.append([True],np.any(np.diff(tmp,axis=0),1))]
    return tmp[np.lexsort((tmp[:, 1], tmp[:, 0]), axis=0)]

def pp(data):
    return [k for k, g in itertools.groupby(sorted(data))]

def gazer(data):
    return np.unique(data, axis=0)

def wim(L):
    return sorted({tuple(x): x for x in L}.values())

def jpp(L):
    return sorted(unique(L, key=tuple))

Setup

res = pd.DataFrame(
       index=['chrisz', 'pp', 'gazer', 'wim', 'jpp'],
       columns=[10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000],
       dtype=float
)

for f in res.index: 
    for c in res.columns:
        npL = np.random.randint(1,1000,(c,2)) + np.random.choice(np.random.random(1000), (c, 2))
        L = npL.tolist()
        stmt = '{}(npL)'.format(f) if f in {'chrisz', 'gazer'} else '{}(L)'.format(f)
        setp = 'from __main__ import L, npL, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True) 
ax.set_xlabel("N"); 
ax.set_ylabel("time (relative)");

plt.show()

enter image description here

Validation

npL = np.random.randint(1,1000,(100000,2)) + np.random.choice(np.random.random(1000), (100000, 2))    
L = npL.tolist()    
chrisz(npL).tolist() == pp(L) == gazer(npL).tolist() == wim(L) == jpp(L)
True

Upvotes: 2

AGN Gazer
AGN Gazer

Reputation: 8378

What you want seems to be easily done with numpy's unique function:

import numpy as np
u = np.unique(data, axis=0) # or np.unique(data, axis=0).tolist()

If you are really worried that the array is not sorted by columns, then run np.lexsort() in addition to the above:

u = u[np.lexsort((u[:,1], u[:,0]))]

Timings (non-random sample):

In [1]: import numpy as np

In [2]: from toolz import unique

In [3]: data = [[450.0, 486.6], [500.0, 400.0], [450.0, 313.3],
   ...:  [350.0, 313.3], [300.0, 400.0], [349.9, 486.6], [450.0, 313.3]]
   ...:  

In [4]: L = 100000 * data

In [5]: npL = np.array(L)

In [6]: %timeit sorted(unique(L, key=tuple))
125 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit sorted({tuple(x): x for x in L}.values())
139 ms ± 3.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit np.unique(L, axis=0)
732 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %timeit np.unique(npL, axis=0)
584 ms ± 8.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# @user3483203 solution:

In [57]: %timeit lex(np.asarray(L))
227 ms ± 8.34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [58]: %timeit lex(npL)
76.2 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Timings (more random sample):

When sample data are more random, the results are different:

In [29]: npL = np.random.randint(1,1000,(100000,2)) + np.random.choice(np.random.random(1000), (100000, 2))

In [30]: L = npL.tolist()

In [31]: %timeit sorted(unique(L, key=tuple))
143 ms ± 2.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [32]: %timeit sorted({tuple(x): x for x in L}.values())
134 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [33]: %timeit np.unique(L, axis=0)
78.5 ms ± 1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [34]: %timeit np.unique(npL, axis=0)
54 ms ± 398 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# @Paul Panzer's solution:

In [36]: import itertools

In [37]: %timeit [k for k, g in itertools.groupby(sorted(L))]
123 ms ± 3.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# @user3483203 solution:

In [54]: %timeit lex(np.asarray(L))
60.1 ms ± 744 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [55]: %timeit lex(npL)
38.8 ms ± 728 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 2

jpp
jpp

Reputation: 164623

Here's one way using sorted and toolz.unique:

from toolz import unique

res = sorted(unique(L, key=tuple))

print(res)

[[300.0, 400.0], [349.9, 486.6], [350.0, 313.3],
 [450.0, 313.3], [450.0, 486.6], [500.0, 400.0]]

Note toolz.unique is also available via the standard library as the itertools unique_everseen recipe. Tuple conversion is necessary as the algorithm uses hashing via set to check uniqueness.

Performance using set appears slightly better than dict here, but as always you should test with your data.

L = L*100000

%timeit sorted(unique(L, key=tuple))               # 223 ms
%timeit sorted({tuple(x): x for x in L}.values())  # 243 ms

I suspect this is because unique is lazy, and so you have less memory overhead since sorted isn't making a copy of the input data.

Upvotes: 0

Paul Panzer
Paul Panzer

Reputation: 53029

As we are sorting anyway we can dedupe with groupby:

>>> import itertools
>>> [k for k, g in itertools.groupby(sorted(data))]                                                                 
[[300.0, 400.0], [349.9, 486.6], [350.0, 313.3], [450.0, 313.3], [450.0, 486.6], [500.0, 400.0]]                    

A few timings:

>>> import numpy as np # just to create a large example
>>> a = np.random.randint(0, 215, (10000, 2)).tolist()
>>> len([k for k, g in groupby(sorted(a))])
8977 # ~ 10% duplicates
>>> 
>>> timeit("[k for k, g in groupby(sorted(a))]", globals=globals(), number=1000)
6.1627248489967315
>>> timeit("sorted({tuple(x): x for x in a}.values())", globals=globals(), number=1000)
6.654527607999626
>>> timeit("sorted(unique(a, key=tuple))", globals=globals(), number=1000)
7.198703720991034
>>> timeit("np.unique(a, axis=0).tolist()", globals=globals(), number=1000)
8.848866895001265

Upvotes: 2

wim
wim

Reputation: 362547

That is the normal sort order for a list of lists, anyway. De-dupe it with a dict.

>>> L = [[450.0, 486.6], [500.0, 400.0], [450.0, 313.3], [350.0, 313.3], [300.0, 400.0], [349.9, 486.6], [450.0, 313.3]]
>>> sorted({tuple(x): x for x in L}.values())
[[300.0, 400.0],
 [349.9, 486.6],
 [350.0, 313.3],
 [450.0, 313.3],
 [450.0, 486.6],
 [500.0, 400.0]]

Upvotes: 6

Related Questions