Reputation: 123
I have a list of x
,y
coordinates that I need to sort based on the x
coordinate, then y
coordinate when x
is the same and eliminate duplicates of the same coordinates. For example, if the list is:
[[450.0, 486.6], [500.0, 400.0], [450.0, 313.3], [350.0, 313.3], [300.0, 400.0],
[349.9, 486.6], [450.0, 313.3]]
I would need to rearrange it to:
[[300.0, 400.0], [349.9, 486.6], [350.0, 313.3], [450.0, 313.3], [450.0, 486.6],
[500.0, 400.0]]
(with one duplicate of [450.0, 313.3]
removed)
Upvotes: 4
Views: 977
Reputation: 51165
We can do this quite fast using np.lexsort
and some masking
def lex(arr):
tmp = arr[np.lexsort(arr.T),:]
tmp = tmp[np.append([True],np.any(np.diff(tmp,axis=0),1))]
return tmp[np.lexsort((tmp[:, 1], tmp[:, 0]), axis=0)]
L = np.array(L)
lex(L)
# Output:
[[300. 400. ]
[349.9 486.6]
[350. 313.3]
[450. 313.3]
[450. 486.6]
[500. 400. ]]
Functions
def chrisz(arr):
tmp = arr[np.lexsort(arr.T),:]
tmp = tmp[np.append([True],np.any(np.diff(tmp,axis=0),1))]
return tmp[np.lexsort((tmp[:, 1], tmp[:, 0]), axis=0)]
def pp(data):
return [k for k, g in itertools.groupby(sorted(data))]
def gazer(data):
return np.unique(data, axis=0)
def wim(L):
return sorted({tuple(x): x for x in L}.values())
def jpp(L):
return sorted(unique(L, key=tuple))
Setup
res = pd.DataFrame(
index=['chrisz', 'pp', 'gazer', 'wim', 'jpp'],
columns=[10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000],
dtype=float
)
for f in res.index:
for c in res.columns:
npL = np.random.randint(1,1000,(c,2)) + np.random.choice(np.random.random(1000), (c, 2))
L = npL.tolist()
stmt = '{}(npL)'.format(f) if f in {'chrisz', 'gazer'} else '{}(L)'.format(f)
setp = 'from __main__ import L, npL, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=50)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");
plt.show()
Validation
npL = np.random.randint(1,1000,(100000,2)) + np.random.choice(np.random.random(1000), (100000, 2))
L = npL.tolist()
chrisz(npL).tolist() == pp(L) == gazer(npL).tolist() == wim(L) == jpp(L)
True
Upvotes: 2
Reputation: 8378
What you want seems to be easily done with numpy
's unique
function:
import numpy as np
u = np.unique(data, axis=0) # or np.unique(data, axis=0).tolist()
If you are really worried that the array is not sorted by columns, then run np.lexsort()
in addition to the above:
u = u[np.lexsort((u[:,1], u[:,0]))]
In [1]: import numpy as np
In [2]: from toolz import unique
In [3]: data = [[450.0, 486.6], [500.0, 400.0], [450.0, 313.3],
...: [350.0, 313.3], [300.0, 400.0], [349.9, 486.6], [450.0, 313.3]]
...:
In [4]: L = 100000 * data
In [5]: npL = np.array(L)
In [6]: %timeit sorted(unique(L, key=tuple))
125 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [7]: %timeit sorted({tuple(x): x for x in L}.values())
139 ms ± 3.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [8]: %timeit np.unique(L, axis=0)
732 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [9]: %timeit np.unique(npL, axis=0)
584 ms ± 8.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# @user3483203 solution:
In [57]: %timeit lex(np.asarray(L))
227 ms ± 8.34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [58]: %timeit lex(npL)
76.2 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
When sample data are more random, the results are different:
In [29]: npL = np.random.randint(1,1000,(100000,2)) + np.random.choice(np.random.random(1000), (100000, 2))
In [30]: L = npL.tolist()
In [31]: %timeit sorted(unique(L, key=tuple))
143 ms ± 2.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [32]: %timeit sorted({tuple(x): x for x in L}.values())
134 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [33]: %timeit np.unique(L, axis=0)
78.5 ms ± 1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [34]: %timeit np.unique(npL, axis=0)
54 ms ± 398 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# @Paul Panzer's solution:
In [36]: import itertools
In [37]: %timeit [k for k, g in itertools.groupby(sorted(L))]
123 ms ± 3.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# @user3483203 solution:
In [54]: %timeit lex(np.asarray(L))
60.1 ms ± 744 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [55]: %timeit lex(npL)
38.8 ms ± 728 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 2
Reputation: 164623
Here's one way using sorted
and toolz.unique
:
from toolz import unique
res = sorted(unique(L, key=tuple))
print(res)
[[300.0, 400.0], [349.9, 486.6], [350.0, 313.3],
[450.0, 313.3], [450.0, 486.6], [500.0, 400.0]]
Note toolz.unique
is also available via the standard library as the itertools
unique_everseen
recipe. Tuple conversion is necessary as the algorithm uses hashing via set
to check uniqueness.
Performance using set
appears slightly better than dict
here, but as always you should test with your data.
L = L*100000
%timeit sorted(unique(L, key=tuple)) # 223 ms
%timeit sorted({tuple(x): x for x in L}.values()) # 243 ms
I suspect this is because unique
is lazy, and so you have less memory overhead since sorted
isn't making a copy of the input data.
Upvotes: 0
Reputation: 53029
As we are sorting anyway we can dedupe with groupby
:
>>> import itertools
>>> [k for k, g in itertools.groupby(sorted(data))]
[[300.0, 400.0], [349.9, 486.6], [350.0, 313.3], [450.0, 313.3], [450.0, 486.6], [500.0, 400.0]]
A few timings:
>>> import numpy as np # just to create a large example
>>> a = np.random.randint(0, 215, (10000, 2)).tolist()
>>> len([k for k, g in groupby(sorted(a))])
8977 # ~ 10% duplicates
>>>
>>> timeit("[k for k, g in groupby(sorted(a))]", globals=globals(), number=1000)
6.1627248489967315
>>> timeit("sorted({tuple(x): x for x in a}.values())", globals=globals(), number=1000)
6.654527607999626
>>> timeit("sorted(unique(a, key=tuple))", globals=globals(), number=1000)
7.198703720991034
>>> timeit("np.unique(a, axis=0).tolist()", globals=globals(), number=1000)
8.848866895001265
Upvotes: 2
Reputation: 362547
That is the normal sort order for a list of lists, anyway. De-dupe it with a dict.
>>> L = [[450.0, 486.6], [500.0, 400.0], [450.0, 313.3], [350.0, 313.3], [300.0, 400.0], [349.9, 486.6], [450.0, 313.3]]
>>> sorted({tuple(x): x for x in L}.values())
[[300.0, 400.0],
[349.9, 486.6],
[350.0, 313.3],
[450.0, 313.3],
[450.0, 486.6],
[500.0, 400.0]]
Upvotes: 6