Reputation: 10789
I have a list of points that looks like this:
points = [(54592748,54593510),(54592745,54593512), ...]
Many of these points are similar in the sense that points[n][0] is almost equal to points[m][0] AND points[n][1] is almost equal to points[m][1]. Where 'almost equal' is a whatever integer I decide. I would like to filter out all the similar points from the list, keeping just one of it.
Here is my code.
points = [(54592748,54593510),(54592745,54593512),(117628626,117630648),(1354358,1619520),(54592746,54593509)]
md = 10 # max distance allowed between two points
to_compare = points[:] # make a list of item to compare
to_remove = set() # keep track of items to be removed
for point in points:
to_compare.remove(point) # do not compare with itself
for other_point in to_compare:
if abs(point[0]-other_point[0]) <= md and abs(point[1]-other_point[1]) <= md:
to_remove.add(other_point)
for point in to_remove:
points.remove(point)
It works...
>>>points
[(54592748, 54593510), (117628626, 117630648), (1354358, 1619520)]
but I am looking for a faster solution since my list is millions items long.
PyPy helped a lot, it speeded up 6 the whole process 6 times, but probably there is a more efficient way to do this in the first place, or not?
Any help is very welcome.
=======
I have tested some of the answers with the points object you can pickle.load() from here https://mega.nz/#!TVci1KDS!tE5fTnjpPwbvpFTmW1TLsVXDvYHbRF8F7g10KGdOPCs
My code takes 1104 seconds and reduces the list to 96428 points (from 99920). David's code do the job in 14 seconds! But misses something, 96431 points left. Martin's code takes 0.06 seconds!! But also misses something, 96462 points left.
Any clue about why the results are not the same?
Upvotes: 2
Views: 624
Reputation: 46759
Depending on how accurate you need this to be, the following approach should work well:
points = [(54592748, 54593510), (54592745, 54593512), (117628626, 117630648), (1354358, 1619520), (54592746, 54593509)]
d = 20
hpoints = {((x - (x % d)), (y - (y % d))) : (x,y) for x, y in points}
for x in hpoints.itervalues():
print x
This converts each point into a dictionary key with each x
and y
coordinate rounded by its modulus. The result is a dictionary holding the coordinate of the last point in a given area. For the data you have given, this would display the following:
(117628626, 117630648)
(54592746, 54593509)
(1354358, 1619520)
Upvotes: 2
Reputation: 2579
This function for getting unique items from a list (it isn't mine, I found it a while back) only loops over the list once (plus dictionary lookups).
def unique(seq, idfun=None):
# order preserving
if idfun is None:
def idfun(x): return x
seen = {}
result = []
for item in seq:
marker = idfun(item)
# in old Python versions:
# if seen.has_key(marker)
# but in new ones:
if marker in seen: continue
seen[marker] = 1
result.append(item)
return result
The id function will require some cleverness. point[0] is divided by error and floored to an integer. So all point[0]'s such that x*error <= point[0] < (x+1)*error are the same and similarly for point[1].
def id(point):
error = 4
x = point[0]//error
y = point[1]//error
idValue = str(x)+"//"+str(y)
return idValue
So these functions will reduce points between consecutive multiples of error to the same point. The good news is that it only touches the original list once plus the dictionary lookups. The bad news is that this id function won't catch for example 15 and 17 should be the same because 15 reduces to 3 and 17 reduces to 4. It is possible that will some cleverness, this issue could be resolved.
[NOTE: I originally used exponents of primes for the idValue, but the exponents would be way to large. If you could make the idValue an int, that would increase lookup speed ]
Upvotes: 0
Reputation: 56
Sorting the list first avoids the inner for loop and thus the n^2 time. I'm not sure if it's practically any quicker though since I don't have your full data. Try this (it outputs the same as far as i can see from your example points, just ordered).
points = [(54592748,54593510),(54592745,54593512),(117628626,117630648),(1354358,1619520),(54592746,54593509)]
md = 10 # max distance allowed between two points
points.sort()
to_remove = set() # keep track of items to be removed
for i, point in enumerate(points):
if i == len(points) - 1:
break
other_point = points[i+1]
if abs(point[0]-other_point[0]) <= md and abs(point[1]-other_point[1]) <= md:
to_remove.add(point)
for point in to_remove:
points.remove(point)
print(points)
Upvotes: 2