Jared
Jared

Reputation: 75

why does unordered set mix the values

I am trying to remove duplicates from an vector by using an unordered_set. but my design creates an unordered_set that doesn't maintain the ordering correctly. in this example the "z" is not at the end. What am I doing wrong? thank you in advance.

EDIT: sorry if I wasn't clear with what I was looking for. I want the output to be "e,d,a,b,c,z" I want to keep original ordering but remove duplicates. I currently have it working using about 3 different for loops and an extra copy of the init vector. I was just looking for a STL function that was cleaner if possible.

output produced: e d a b c a a a a b b b b c z printing unordered set e d a z b c

#include <iostream> 
#include <iterator>     
#include <algorithm>    
#include <string>
#include <unordered_set>
using namespace std;

int main() {
    vector<string>terminals = { "e", "d", "a", "b", "c", "a", "a", "a", "a", "b","b", "b", "b", "c", "z" };
    for (vector<string>::iterator it = terminals.begin(); it != terminals.end(); it++) // print given vector
        cout << *it << " ";
    cout << endl;
    unordered_set<string> newSet;
    copy(terminals.begin(), terminals.end(), inserter(newSet, newSet.end()));
    cout << "printing unordered set" << endl;
    for (unordered_set<string>::iterator it = newSet.begin(); it != newSet.end(); it++)
        cout << *it << " ";
    cout << endl;
    //system("pause");
    return 0;
}

Upvotes: 3

Views: 442

Answers (5)

Ted Lyngmo
Ted Lyngmo

Reputation: 117288

std::unordered_set:

Internally, the elements are not sorted in any particular order, but organized into buckets. Which bucket an element is placed into depends entirely on the hash of its value. This allows fast access to individual elements, since once a hash is computed, it refers to the exact bucket the element is placed into.

If you need the unique terminals to be ordered, use a std::set:

#include <iostream>
#include <vector>
#include <string>
#include <set>

int main() {
    std::vector<std::string>terminals = { "e", "d", "a", "b", "c", "a", "a", "a", "a", "b","b", "b", "b", "c", "z" };

    for(const std::string& terminal : terminals) // print given vector
        std::cout << terminal << " ";
    std::cout << "\n";;

    // populate the set directly from the vectors iterators:
    std::set<std::string> newSet(terminals.begin(), terminals.end());;

    std::cout << "printing the (ordered) set:" << "\n";;
    for(const std::string& terminal : newSet)
        std::cout << terminal << " ";
    std::cout << "\n";;
}

If you want to maintain the original order, you can't use either set as your final storage, but you can use a std::unordered_set as a cache/blacklist for values you've already inserted in your final storage.

#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
#include <unordered_set>

int main() {
    std::vector<std::string>terminals = { "e", "d", "a", "b", "c", "a", "a", "a", "a", "b","b", "b", "b", "c", "z" };

    for(const std::string& terminal : terminals) // print given vector
        std::cout << terminal << " ";
    std::cout << "\n";;

    std::vector<std::string> newSet; // not really a set anymore
    std::unordered_set<std::string> cache; // blacklist

    // try to insert all terminals and only when an insert is successful,
    // put the terminal in newSet

    std::for_each(terminals.begin(), terminals.end(),
        [&](const std::string& terminal) {
            auto [it, inserted] = cache.insert(terminal);
            if(inserted)
                newSet.push_back(terminal);
        }
    );

    std::cout << "printing the vector of unique terminals:" << "\n";;
    for(const std::string& terminal : newSet)
        std::cout << terminal << " ";
    std::cout << "\n";;
}

If you want the original order and don't mind making changes directly to the original terminals vector, you can use std::remove_if combined with an unordered_set which is nice since it doesn't require a new vector. This is an annotated variant of @Marek R's answer:

Read this first: Erase–remove idiom

int main() {
    std::vector<std::string>terminals = { "e", "d", "a", "b", "c", "a", "a", "a", "a", "b","b", "b", "b", "c", "z" };

    for(const std::string& terminal : terminals) // print given vector
        std::cout << terminal << " ";
    std::cout << "\n";;

    std::unordered_set<std::string> cache; // blacklist

    // remove_if() moves all entries in your container, for which the
    // UnaryPredicate(*) returns true, to the end of the container. It returns
    // an iterator pointing to the first element in the vector that was
    // moved - which is a suitable starting point for a subsequent erase().
    //
    // (*) UnaryPredicate: A callable that returns true or false given a single
    //                     value.

    // auto past_new_end = std::vector<std::string>::iterator past_new_end
    auto past_new_end = std::remove_if(terminals.begin(), terminals.end(),
        // this lambda is the UnaryPredicate
        [&](const std::string& terminal) {
            // insert returns a std::pair<Iterator, bool>
            // where the bool (.second in the pair) is false
            // if the value was not inserted (=it was already present)
            return cache.insert(terminal).second == false;
        }
    );

    std::cout << "display all the entries (now with unspecified values) "
                 "that will be erased:\n";
    std::copy(past_new_end, terminals.end(),
                            std::ostream_iterator<std::string>(std::cout, "<"));
    std::cout << "\n";

    // erase all the moved entries
    terminals.erase(past_new_end, terminals.end());

    std::cout << "printing the unique terminals:" << "\n";;
    for(const std::string& terminal : terminals)
        std::cout << terminal << " ";
    std::cout << "\n";;
}

Upvotes: 5

Marek R
Marek R

Reputation: 37607

I am trying to remove duplicates from an vector by using an unordered_set.

Why do you assumed that unordered_set conserves any kind of order? Name clearly states that there is no any particular order.

You should use unordered_set only to keep track if item was already found in sequence. Based on that you can remove item from sequence, so this should go like this:

void removeDuplicates(Data &data)
{
    std::unordered_set<std::string> foundItems;
    auto newEnd = std::remove_if(data.begin(), data.end(), [&foundItems](const auto &s)
                                 {
                                     return !foundItems.insert(s).second;
                                 });
    data.erase(newEnd, data.end());
}

https://wandbox.org/permlink/T24UfiLQep0XUQhQ

Upvotes: 0

Mwak
Mwak

Reputation: 104

Looks like you want to use a (ordered) set.

Edit : Looks like you don't, actually. A std::vector could work, but it's probably not the cleanest workaround.

Upvotes: 2

bw0248
bw0248

Reputation: 711

You could also use an unordered map and then store the item as as key to the map and the index as the corresponding value to that key.

Upvotes: 0

Jerry Coffin
Jerry Coffin

Reputation: 490108

If you want to maintain the original order, but enforce uniqueness, you probably want to:

  1. read in an item.
  2. Try to insert it into the set
  3. if that succeeds, it wasn't previously in the set, so also copy it to the output
  4. Repeat

If you want the outputs ordered (so, in you example, the output would be "a b c d e z"), then you could either insert the items in an std::set, or else you could use std::sort followed by std::unique to get exactly one of each unique element in the input.

Upvotes: 2

Related Questions