alec_djinn
alec_djinn

Reputation: 10789

Best way to remove similar points in a list of points

I have a list of points that looks like this:

points = [(54592748,54593510),(54592745,54593512), ...]

Many of these points are similar in the sense that points[n][0] is almost equal to points[m][0] AND points[n][1] is almost equal to points[m][1]. Where 'almost equal' is a whatever integer I decide. I would like to filter out all the similar points from the list, keeping just one of it.

Here is my code.

points = [(54592748,54593510),(54592745,54593512),(117628626,117630648),(1354358,1619520),(54592746,54593509)]
md = 10 # max distance allowed between two points
to_compare = points[:] # make a list of item to compare
to_remove = set() # keep track of items to be removed

for point in points:
    to_compare.remove(point) # do not compare with itself
    for other_point in to_compare:
        if abs(point[0]-other_point[0]) <= md and abs(point[1]-other_point[1]) <= md:
             to_remove.add(other_point)

for point in to_remove:
    points.remove(point)

It works...

>>>points
[(54592748, 54593510), (117628626, 117630648), (1354358, 1619520)]

but I am looking for a faster solution since my list is millions items long.

PyPy helped a lot, it speeded up 6 the whole process 6 times, but probably there is a more efficient way to do this in the first place, or not?

Any help is very welcome.

=======

UPDATE

I have tested some of the answers with the points object you can pickle.load() from here https://mega.nz/#!TVci1KDS!tE5fTnjpPwbvpFTmW1TLsVXDvYHbRF8F7g10KGdOPCs

My code takes 1104 seconds and reduces the list to 96428 points (from 99920). David's code do the job in 14 seconds! But misses something, 96431 points left. Martin's code takes 0.06 seconds!! But also misses something, 96462 points left.

Any clue about why the results are not the same?

Upvotes: 2

Views: 624

Answers (3)

Martin Evans
Martin Evans

Reputation: 46759

Depending on how accurate you need this to be, the following approach should work well:

points = [(54592748, 54593510), (54592745, 54593512), (117628626, 117630648), (1354358, 1619520), (54592746, 54593509)]
d = 20
hpoints = {((x - (x % d)), (y - (y % d))) : (x,y) for x, y in points}

for x in hpoints.itervalues():  
    print x

This converts each point into a dictionary key with each x and y coordinate rounded by its modulus. The result is a dictionary holding the coordinate of the last point in a given area. For the data you have given, this would display the following:

(117628626, 117630648)
(54592746, 54593509)
(1354358, 1619520)

Upvotes: 2

NonlinearFruit
NonlinearFruit

Reputation: 2579

This function for getting unique items from a list (it isn't mine, I found it a while back) only loops over the list once (plus dictionary lookups).

def unique(seq, idfun=None): 
  # order preserving
  if idfun is None:
      def idfun(x): return x
  seen = {}
  result = []
  for item in seq:
      marker = idfun(item)
      # in old Python versions:
      # if seen.has_key(marker)
      # but in new ones:
      if marker in seen: continue
      seen[marker] = 1
      result.append(item)
  return result

The id function will require some cleverness. point[0] is divided by error and floored to an integer. So all point[0]'s such that x*error <= point[0] < (x+1)*error are the same and similarly for point[1].

def id(point):
   error = 4
   x = point[0]//error
   y = point[1]//error
   idValue = str(x)+"//"+str(y)
   return idValue

So these functions will reduce points between consecutive multiples of error to the same point. The good news is that it only touches the original list once plus the dictionary lookups. The bad news is that this id function won't catch for example 15 and 17 should be the same because 15 reduces to 3 and 17 reduces to 4. It is possible that will some cleverness, this issue could be resolved.

[NOTE: I originally used exponents of primes for the idValue, but the exponents would be way to large. If you could make the idValue an int, that would increase lookup speed ]

Upvotes: 0

David Moodie
David Moodie

Reputation: 56

Sorting the list first avoids the inner for loop and thus the n^2 time. I'm not sure if it's practically any quicker though since I don't have your full data. Try this (it outputs the same as far as i can see from your example points, just ordered).

points = [(54592748,54593510),(54592745,54593512),(117628626,117630648),(1354358,1619520),(54592746,54593509)]
md = 10  # max distance allowed between two points
points.sort()
to_remove = set()  # keep track of items to be removed

for i, point in enumerate(points):
    if i == len(points) - 1:
        break
    other_point = points[i+1]
    if abs(point[0]-other_point[0]) <= md and abs(point[1]-other_point[1]) <= md:
        to_remove.add(point)

for point in to_remove:
    points.remove(point)

print(points)

Upvotes: 2

Related Questions