prestokeys
prestokeys

Reputation: 4839

Generalization of "remove all duplicates"

Let related be a binary predicate. Let us define a "relation subset" as the set of all elements such that any element in the subset is related to at least one other element in the subset via 'related', other than itself (hence whether an element is related to itself or not is irrelevant in forming the relation subsets). Note: a relation subset is NOT necessarily a strongly connected component. For example, suppose A is related to B, and B and C are related to each other. Then {A,B,C} is a relation subset by definition, but is not a strongly connected component because there is no path from B to A or from C to A via 'related'.

Note that 'related' is not necessarily symmetric, i.e. related(a,b) == true does not necessarily mean that related(b,a) == true, nor is it transitive, i.e. related(a,b) == true and related(b,c) == true does not necessarily imply related(a,c) == true. As far as I can see, no restrictions on the binary predicate related need be imposed in order to partition a set into relation subsets. If an element x is not related to any element at all (other than itself), then {x} is itself its own relation subset.

One good problem is to define

template <typename Container, typename BinaryPredicate>
void partition (Container& container, BinaryPredicate related);

that will sort the elements of container so that the container is partitioned into its relation subsets. This question arose quickly after attempting the original problem:

template <typename Container, typename BinaryPredicate>
void removeRelated (Container& container, BinaryPredicate related);

which is to remove (from container) all elements from each relation subset, except for the first one of each subset found in the container.

Of course, equality is just a special case of 'related', and hence this is a generalization of "remove all duplicates" (that is how I thought up this question, by trying to generalize that well-known solved problem). What we want is to keep one representative of each relation subset, namely the first one according to the container's ordering of elements.

Below is the code I tried to implement, with the following relationships :

Bob knows Mary
Mary knows Jim
Jim knows Bob
Sally knows Fred
Fred knows no one.

In this example, A is related to B if and only if A knows B. {Bob, Mary, Jim} is a relation subset because each person is related to someone else (Bob is related to Mary, Mary is related to Jim, Jim is related to Bob). Note that {Sally, Fred} is NOT a relation subset because though Sally is related to Fred, Fred is not related to Sally. Thus we are left with {Sally}, {Fred} as the other two relation subsets.

So the final answer should be : Bob, Sally, Fred. Note that if the function partition were to be defined, then it would simply leave {Bob, Mary, Jim, Sally, Fred} unchanged, with Bob, Sally, Fred being the first element of each partition. Hence even if we had the partition function (not to be confused with std::partition of course, but I'm not really thinking about a good name right now), it is still not immediately clear what removeRelated would need.

#include <iostream>
#include <algorithm>
#include <list>
#include <set>

template<typename Container, typename BinaryPredicate>
void removeRelated (Container& container, BinaryPredicate related) {
    using element = typename Container::value_type;
    std::set<element> s;
    for (typename Container::iterator it = std::begin(container);  it != std::end(container); ) {
        if (std::find_if(s.begin(), s.end(),
                [=](const element& x)->bool {return related(*it,x);}) != s.end())
            it = container.erase(it);  // *it is related to an element in s.
        else {
            s.insert(*it);
            it++;
        }
    }
}

struct Person {  // Used for testing removeRelated.
    std::string name;
    std::list<Person*> peopleKnown;
    Person (const std::string& n) : name(n) {}
    void learnsAbout (Person* p) {peopleKnown.push_back(p);}
    bool knows (Person* p) const {
        return std::find(peopleKnown.begin(), peopleKnown.end(), p) != peopleKnown.end();
    }
};

int main() {
    Person *bob = new Person("Bob"), *mary = new Person("Mary"), *jim = new Person("Jim"),
        *sally = new Person("Sally"), *fred = new Person("Fred");
    bob->learnsAbout(mary);
    mary->learnsAbout(jim);
    jim->learnsAbout(bob);
    sally->learnsAbout(fred);
    std::list<Person*> everybody {bob, mary, jim, sally, fred};
    removeRelated (everybody, [](Person* a, Person* b)->bool {return a->knows(b);});
    for (const Person* x : everybody) std::cout << x->name << ' ';  // Bob Mary Sally Fred
//  Should be 'Bob Sally Fred' since {Bob, Mary, Jim}, {Sally}, {Fred} are the relation subsets.
}

The bug from the code above is that Mary is inserted into 's' because at that point she is not related to anyone in s (through 'knows'). In the end, she is related to Jim, who is related to Bob (and Bob to Mary) through 'knows', and hence {Bob, Mary, Jim} is a relation subset, so bob should be the only one of these three people in 's'.

But during the iteration the function does not know that. How can I fix the algorithm? One idea is to perhaps first define the function partition mentioned above, i.e. to sort the container so that the container is partitioned into its relation subsets (which is a very nice problem in itself, and could very well be the main question at hand), and then simply take the first element of each partition.

Another idea is to replace the lambda function

[=](const element& x)->bool {return related(*it,x);}

with

[=](const element& x)->bool {return relatedIndirectly(*it,x);}

and I think it might solve the problem, where the helper function relatedIndirectly searches for a chain of relations to x.

Here is another example examined by Cimbali (below). Suppose A is related to B, A is related to C, B is related to C. Then {A,B} cannot be a relation subset because B is not related to A. Similarly, {A,C} and {B,C} cannot be relation subsets (C is not related to A, and C is not related to B), and {A,B,C} definitely not since C is not related to anyone. {A}, {B}, {C} is the only partitioning that satisfies my definition of relation subsets.

Conjecture: A relation subset is always the union of strongly connected components such that each strongly connected component in the union has at least one element that is related to some element in a different strongly connected component in the union. But this would need mathematical proof.

Update: I've strengthened the above definition of related subset (greatly) so that: BinaryPredicate 'related' shall be reflexive (related(x,x) == true for any x), symmetric (related(x,y) implies related(y,x)), and transitive (related(x,y) and related(y,z) implies related(x,z)). This partitions any set into equivalence classes. removeRelated shall remove all elements from every equivalence class except for the first element in the container. This generalizes the classic problem of "remove all duplicates", since equality is a special case of an equivalence relation. The following code now gives the correct results, but I wonder if there is a way to weaken the condition of 'related' and still get the same results.

#include <iostream>
#include <algorithm>
#include <list>
#include <set>

template<typename Container, typename BinaryPredicate>
void removeRelated (Container& container, BinaryPredicate related) {
    using element = typename Container::value_type;
    std::set<element> s;
    for (typename Container::iterator it = std::begin(container);  it != std::end(container); ) {
        if (std::find_if(s.begin(), s.end(),
                [=](const element& x)->bool {return related(*it,x);}) != s.end())
            it = container.erase(it);  // *it is related to an element in s.
        else {
            s.insert(*it);
            it++;
        }
    }
}

// Used for testing removeRelated.  Person::isRelativeOf is an equivalence relation.
struct Person {
    std::string name;
    std::list<Person*> relatives;
    Person (const std::string& n) : name(n) {relatives.push_back(this);}  // Forcing reflexivity
    void addRelative (Person* p) {
        for (Person* relatives_of_p : p->relatives)
            relatives.push_back(relatives_of_p);  // Forcing transitivity ('relatives.push_back(p);' included in this)
        p->relatives.push_back(this);  // Forcing symmetry
    }
    bool isRelativeOf (Person* p) const {
        return std::find(relatives.begin(), relatives.end(), p) != relatives.end();
    }
};

int main() {
    Person *bob = new Person("Bob"), *mary = new Person("Mary"), *jim = new Person("Jim"),
        *sally = new Person("Sally"), *fred = new Person("Fred");
    bob->addRelative(mary);  // crashes
    mary->addRelative(jim);
    jim->addRelative(bob);
    sally->addRelative(fred);
    std::list<Person*> everybody {bob, mary, jim, sally, fred};
    removeRelated (everybody, [](Person* a, Person* b)->bool {return a->isRelativeOf(b);});
    for (const Person* x : everybody) std::cout << x->name << ' ';  // Bob Sally (correct)
}

If 'related' is 'knows' or 'is friends of', which need not be symmetric nor transitive, will we still have a partitioning and thus removeRelated can still work?

And another question I wonder about is: what is the fastest sorting algorithm to sort the above so that the equivalence classes consist of consecutive elements? This is what I came up with:

template<typename Container, typename BinaryPredicate>
void sortByEquivalenceClasses (Container& container, BinaryPredicate related) {
    for (auto it = container.begin();  it != container.end();  ++it)
        for (auto jt = std::next(it);  jt != container.end();  ++jt)
            if (related(*it, *jt)) {
                std::iter_swap(jt, std::next(it));
                break;
            }
}

But the sort does not preserve the original relative ordering of the elements. How to preserve that?

Upvotes: 3

Views: 192

Answers (1)

Cimbali
Cimbali

Reputation: 11415

Okay, you define a subset by the following

Let us define a "relation subset" as the set of all elements such that any element in the subset is related to at least one other element in the subset via 'related', other than itself

This sound a bit recursive, but as I understand it

  • A node n without outgoing relations is in the subset that is the singleton {n}
  • A node n that has outgoing relations only to singletons, is in the subset that is the singleton {n}
  • Any cycle (of any length), and more generally any strongly connected component forms a subset, including all their predecessor nodes for the relation.

Example. The following relations :

A -> B
B -> A
C -> B
D -> C
E -> F

Define the following subsets : {A, B, C, D}, {E} and {F}


Assuming the above is correct, we can devise the following algorithm (pseudocode) :

int visit(Node n, int s) { // returns 0 iff there is no cycle anywhere beneath

    if(length(n.relations) = 0 || n.singleton == true)
    // leaves, or only leaves below
    {
        n.singleton = true;
        return false;
    }
    else if(n.visited == true || n.bigger_subset > 0)
    // part of a subcycle, or predecessor of one
    {
        n.bigger_subset = s;
        return true;
    }
    else
    // searching for the nature of what is below
    {
        n.visited = true;
        bool found_cycle = 0;

        for each node m such that n is related to m (n -> m)
            found_cycle = max(Visit(m), found_cycle);

        if( found_cycle > 0 )
            n.bigger_subset = found_cycle;
        else
            n.singleton = true; // parent of only singletons

        n.visited = false;
        return found_cycle;
    }
}

s = length(node_set) + 1; // clearly bigger than the maximal number of subsets
for n in node_set:
{
    if( n.singleton == false && n.big_subcycle == 0 )
    {
        // state unknown, it is thus the first of its subset, unless it is a predecessor (or part) of a known subset
        int set = visit(n, s);

        // not a singleton, and not the current set : a previous one
        if( set > s )
            node_set.remove(n);

        s--;
    }
    else
        node_set.remove(n);
}

What this does is basically a depth-first search from each element on, marking nodes that are being visited, in order to detect cycles. By remembering the state of each node, any predecessor of a subset can be added to the subset without going down to the cycle again.


Here is the C code for this algorithm on the example given above : http://ideone.com/VNumcN

Upvotes: 1

Related Questions