Reputation: 101

How can I remove similar but not duplicate items from a list?

I have a list:

values = [[6.23234121,6.23246575],[1.352672,1.352689],[6.3245,123.35323,2.3]]

What is a way I can go through this list and remove all items that are within say 0.01 to other elements in the same list.

I know how to do it for a specific set of lists using del, but I want it to be general for if values has n lists in it and each list has n elements.

What I want to happen is perform some operation on this list

values = [[6.23234121,6.23246575],[1.352672,1.352689],[6.3245,123.35323,2.3]]

and get this output

new_values = [[6.23234121],[1.352672],[6.3245,123.35323,2.3]]

Upvotes: 0

Answers (1)

Danica

Reputation: 28846

I'm going to write a function to do this for a single list, eg

>>> compact([6.23234121,6.23246575], tol=.01)
[6.23234121]

You can then get it to work on your nested structure through just [compact(l) for l in lst].

Each of these methods will keep the first element that doesn't have anything closer to it in the list; for @DSM's example of [0, 0.005, 0.01, 0.015, 0.02] they'd all return [0, 0.0.15] (or, if you switch > to >=, [0, 0.01, 0.02]). If you want something different, you'll have to define exactly what it is more carefully.

First, the easy approach, similar to David's answer. This is O(n^2):

def compact(lst, tol):
    new = []
    for el in lst:
        if all(abs(el - x) > tol for x in new):
            new.append(el)
    return compact

On three-element lists, that's perfectly nice. If you want to do it on three million-element lists, though, that's not going to cut it. Let's try something different:

import collections
import math

def compact(lst, tol):
    round_digits = -math.log10(tol) - 1
    seen = collections.defaultdict(set)
    new = []
    for el in lst:
        rounded = round(seen, round_digits)
        if all(abs(el - x) > tol for x in seen[rounded]):
            seen[rounded].add(el)
            new.append(el)
    return new

If your tol is 0.01, then round_digits is 1. So 6.23234121 is indexed in seen as just 6.2. When we then see 6.23246575, we round it to 6.2 and look that up in the index, which should contain all numbers that could possibly be within tol of the number we're looking up. Then we still have to check distances to those numbers, but only on the very few numbers that are in that index bin, instead of the entire list.

This approach is O(n k), where k is the average number of elements that'll fall within one such bin. It'll only be helpful if k << n (as it typically would be, but that depends on the distribution of the numbers you're using relative to tol). Note that it also uses probably more than twice as much memory as the other approach, which could be an issue for very large lists.

Another option would be to sort the list first; then you only have to look at the previous and following elements to check for a conflict.

Upvotes: 1

How can I remove similar but not duplicate items from a list?

Answers (1)

Related Questions