Erel
Erel

Reputation: 749

How to remove duplicates in a vector based on an equality predicate?

I have a struct which roughly looks like that:

struct vec3
{
    int x;
    int y;
    int z;

    constexpr bool operator==(const vec3& v) const
    {
        return (x == v.x) && (y == v.y) && (z == v.z);
    }

    constexpr vec3 operator-() const
    {
        return {-x, -y, -z};
    }
};

I then generate a std::vector of vec3 with random values for each coordinates. The function in which it is used requires that no couple of values {v1, v2} in that vector verifies v1 == -v2. I obviously need that definition of operator== (the one in the snippet above) elsewhere in code, otherwise that problem would be trivial.

I first attempted std::set and std::sort + std::unique, but could not find any way to have a struct filing named requirements Compare for that application (which is needed for both approaches).

How can I proceed?

Note:

This is somewhat different from Removing duplicates from a non-sortable vector in which pointers are sorted, and also from C++ how to remove duplicates from vector of Class type? in which the elements are sortable according to some criteria (I think).

Upvotes: 0

Views: 518

Answers (2)

Erel
Erel

Reputation: 749

I believe the simplest method would be to use std::unordered_set and exploit its second and third template parameters.

Method

  1. Define a hash function

This step goal is a provide a " pre-filtering " step which eliminates obvious duplicates according to the meaning in the context (so here, v1 and -v1 should for example have the same hash).

That's something that should be on a per-class basis . There is no way to come up with a generic performant hashing function, though non-performant critical application may use the first hasher below (but I won't really recommend that anymore).

a. The bad hasher

This is what I originally proposed, before taking comment from @axnsan and @François Andrieux in count.

The simplest hasher I can think of looks like that:

struct bad_hasher
{
    std::size_t operator()(const value_type&) const
    {
        return 0;
    }
};

It makes all hash collide and forces std::unordered_set to use KeyEqual to determine objects equality. So indeed, that works, but that does not make it a good choice. @axnsan and @François Andrieux pointed the following drawbacks in the comment below:

  • "it changes its performance characteristics to O(n^2) (it will have to iterate through all elements on each lookup) "(- @axnsan)
  • "[it converts] the set into a simple unsorted linked list. Every element will collide with every other element, it it looks like typical implementations use collision chaining". (- @François Andrieux)

In other words, this makes std::unordered_set become the same as a std::vector + std::remove_if.

b. The better hasher

@axnsan suggests the following hasher for this particular application:

struct better_hasher
{
    std::size_t operator()(const vec3& v) const
    {
        return static_cast<std::size_t>(std::abs(v.x) + std::abs(v.y) + std::abs(v.z));
    }
};

It fills the following requirements:

  • better_hasher(v) == better_hasher(-v).
  • v1 != v2 => better_hasher(v1) != better_hasher(v2) in most cases ((1, 0, 1) will collide with (1, 1, 0) for example)
  • not all hashes collide.
  • removes obvious duplicates.

Which is probably somewhere near the best we could hope for in this configuration.

We then need to remove those "false positives" due to hash collisions.

  1. Define a key equality predicate

The goal here is to remove duplicates that were not detected by the hasher (here, typically vectors such as (1, 0, 1) / (1, 1, 0) or overflow).

Declare a predicate struct which roughly looks like:

struct key_equal
{
    bool operator()(const value_type& a, const value_type& b) const
    {
        
        return (a == b) || <...>;
    }
};

Where <...> is anything making two values identical in the current context ( so here a == b) || -a == b for example).

Note that this expects operator== to be implemented.

  1. Erase duplicates

Declare an std::unordered_set which removes duplicates:

const std::unordered_set<value_type, hasher, key_equal> s(container.begin(), container.end());
container.assign(s.begin(), s.end());
  1. (alt) Erase duplicates (and conserve original order in container)

Basically the same, but this checks if an element can be inserted in the std::unordered_set, and if does not, remove it. Adapted from @yury's answer in What's the most efficient way to erase duplicates and sort a vector?.

std::unordered_set<value_type, hasher, comparator> s;

const auto predicate = [&s](const value_type& value){return !s.insert(value).second;};

container.erase(std::remove_if(container.begin(), container.end(), predicate), container.end());

Generic (container-independent) templated function:

template<typename key_equal_t, typename hasher_t, bool order_conservative, typename container_t>
void remove_duplicates(container_t& container)
{
    using value_type = typename container_t::value_type;

    if constexpr(order_conservative)
    {
        std::unordered_set<value_type, hasher_t, key_equal_t> s;
        const auto predicate = [&s](const value_type& value){return !s.insert(value).second;};
        container.erase(std::remove_if(container.begin(), container.end(), predicate), container.end());
    }
    else
    {
        const std::unordered_set<value_type, hasher, key_equal_t> s(container.begin(), container.end());
        container.assign(s.begin(), s.end());
    }
}

Expects to be provided key_equal_t and hasher_t (and a bool known a compile time indicating if you care about element being kept in the same order or not). I did not benchmark any of the two branches in this function so I do not know if one is significantly better than another, though this answer seems show manual insertion may be faster.

Example in this use case:

/// "Predicate" indicating if two values are considered duplicates or not
struct key_equal
{
    bool operator()(const vec3& a, const vec3& b) const
    {
        // Remove identical vectors and their opposite
        return (a == b) || (-a == b);
    }
};

/// Hashes a vec3 by adding absolute values of its components.
struct hasher
{
    std::size_t operator()(const vec3& v) const
    {
        return static_cast<std::size_t>(std::abs(v.x) + std::abs(v.y) + std::abs(v.z));
    }
};

remove_duplicates<key_equal, hasher, order_conservative>(vec);

Test data:

vec3 v1{1, 1, 0};   // Keep
vec3 v2{0, 1, 0};   // Keep
vec3 v3{1, -1, 0};  // Keep
vec3 v4{-1, -1, 0}; // Remove
vec3 v5{1, 1, 0};   // Remove

std::vector vec{v1, v2, v3, v4, v5};

Test 1: non-order-conservative:

// ...<print vec values>
remove_duplicates<key_equal, hasher, false>(vec);
// ... <print vec values>

Output (before / after):

(1, 1, 0) (0, 1, 0) (1, -1, 0) (-1, -1, 0) (1, 1, 0)
(1, -1, 0) (0, 1, 0) (1, 1, 0) 

Test 2: order-conservative:

// ... <print vec values>
remove_duplicates<key_equal, hasher, true>(vec);
// ... <print vec values>

Output (before / after):

(1, 1, 0) (0, 1, 0) (1, -1, 0) (-1, -1, 0) (1, 1, 0) 
(1, 1, 0) (0, 1, 0) (1, -1, 0) 

Upvotes: 1

eerorika
eerorika

Reputation: 238311

The function in which it is used requires that no couple of values {v1, v2} in that vector fills v1 == -v2

but could not find any way to have a struct filing named requirements Compare for that application (which is needed for both approaches).

It seems to me that you're trying to solve X, but this is the Y in your XY-problem.

It's fairly simple to implement an ordered comparator that satisfies the equality of -v == v. Simply compare absolute values:

struct vec3
{
    int x;
    int y;
    int z;

    // normal comparison that treats -x != x
    friend auto operator<=>(const vec3&, const vec3&) = default;
};

// declare in same namespace as vec3 for ADL
vec3 abs(const vec3& v) {
    return {std::abs(v.x), std::abs(v.y), std::abs(v.z)};
}


struct abs_less {
    template< class T, class U>
    constexpr auto operator()( T&& lhs, U&& rhs ) const
        -> decltype(std::forward<T>(lhs) < std::forward<U>(rhs))
    {
        using std::abs; // for integers
        return abs(lhs) < abs(rhs); // this implementation assumes normal comparison operator
        // you can implement logic directly here if that's not possible
    }
};

This comparator works with both std::set and std::sort + std::unique. Example with set:

std::set<vec3, abs_less> example;

Of course, you could overload the operator< directly and use std::less, but usually I would recommend against non-defaulted operator overloads that have unusual behaviour.

Upvotes: 1

Related Questions