Neil Kirk
Neil Kirk

Reputation: 21763

How to erase a value efficiently from a sorted vector?

Assume that vec is a sorted vector of movable and copyable objects. What is the most efficient way to remove all elements that match value?

Is this correct and the most efficient way?

auto lb = std::lower_bound(vec.begin(), vec.end(), value);
vec.erase(lb, std::upper_bound(std::next(lb), vec.end(), value));

What is the complexity? (Taking into account any moving required after the erasure).

Upvotes: 2

Views: 6668

Answers (3)

Blastfurnace
Blastfurnace

Reputation: 18652

I've done some brief testing with three four different methods of erasing from a sorted container.

void erase_v1(std::vector<int> &vec, int value)
{
    vec.erase(std::remove(std::begin(vec), std::end(vec), value), std::end(vec));
}

void erase_v2(std::vector<int> &vec, int value)
{
    auto lb = std::lower_bound(std::begin(vec), std::end(vec), value);
    if (lb != std::end(vec) && *lb == value) {
        auto ub = std::upper_bound(lb, std::end(vec), value);
        vec.erase(lb, ub);
    }
}

void erase_v3(std::vector<int> &vec, int value)
{
    auto pr = std::equal_range(std::begin(vec), std::end(vec), value);
    vec.erase(pr.first, pr.second);
}

// Surt's code, doesn't preserve sorted order
void erase_v4(std::vector<int> &vec, int value)
{
    // get the range in 2*log2(N), N=vec.size()
    auto bounds = std::equal_range(vec.begin(), vec.end(), value);

    // calculate the index of the first to be deleted O(1)
    auto last = vec.end() - std::distance(bounds.first, bounds.second);

    // swap the 2 ranges O(equals) , equal = std::distance(bounds.first, bounds.last)
    std::swap_ranges(bounds.first, bounds.second, last);

    // erase the victims O(equals)
    vec.erase(last, vec.end());
}

Tested with a std::vector of 10,000,000 elements, filled with random numbers in the range [0..9], and then sorted (MS Visual C++ 2013).

Erase the value 0 (the front of the container), representative times look like:

time=14.3894 size=8999147 // v1, milliseconds and updated container size
time=11.9486 size=8999147 // v2
time=11.5548 size=8999147 // v3
time=1.78913 size=8999147 // v4 (Surt)

Erase 5 (the middle of the container):

time=12.8223 size=9000844
time=4.89388 size=9000844
time=4.87589 size=9000844
time=1.77284 size=9000844

Erase 9 (the end of the container):

time=12.64 size=9000820
time=0.00373372 size=9000820
time=0.00339429 size=9000820
time=1.29899 size=9000820

Erase 13 (value not in container):

time=11.8641 size=10000000
time=0.002376 size=10000000
time=0.00203657 size=10000000
time=0.00220628 size=10000000

The erase/remove method always iterates over the entire container and is slower, the lower_bound/upper_bound and equal_range methods are nearly identical over multiple runs. I prefer the last version because it's correct, simpler code, and less typing.

Edit: Timed Surt's code by request. It is consistently fast at the cost of not preserving sorted order.

Upvotes: 7

Surt
Surt

Reputation: 16089

A solution that leave the vector unsorted after erase.

// get the range in 2*log2(N), N=vec.size()
auto bounds=std::equal_range (vec.begin(), vec.end(), value);  

// calculate the index of the first to be deleted O(1)
auto last = vec.end()-std::distance(bounds.first, bounds.last);

// swap the 2 ranges O(equals) , equal = std::distance(bounds.first, bounds.last)
std::swap_ranges(bounds.first, bounds.last, last);

// erase the victims O(equals)
vec.erase(last, vec.end());

std::remove is O(N) and this solution also does the fewest writes. If equals is near N this might not be so great an idea :)

Upvotes: 6

Barry
Barry

Reputation: 302748

That is not correct in the case that value does not actually appear in vec. So at the very least you will have to do:

auto lb = std::lower_bound(vec.begin(), vec.end(), value);
if (lb != vec.end() && *lb == value) {
    vec.erase(lb, std::upper_bound(std::next(lb), vec.end(), value));
}

As to the question of most efficient: I believe in the general case, knowing nothing about what happens to be in vec, yes. Complexity is still O(N) because erase() is O(N) - you can't really have a non-linear erase if you're erasing like the 2nd element. But in terms of finding the bounds to erase, O(log N) is as good as it gets and you got it.

The question of whether upper_bound() or just find_if() is better for the 2nd part depends completely on how likely you are to have lots of values or not. More likely to have lots, use upper_bound(), more likely to be unique, use find_if().

Upvotes: 3

Related Questions