Binary search a list of integers that are encoded with arbitary number of bytes

Question

A list of sorted integers are encoded in a uint8_t vector. The amount of bytes to encode these integers is arbitrary.

For example, if we use 3 bytes to encode each integer, then the first integer is encoded in the first 3 bytes, the second integer is encoded in the second 3 bytes and so on.

I would like to binary search an integer from this uint8_t vector.

My attempt so far: std::bsearch seems what I want. In which, I can specify the size of each element in the array in bytes. However, to implement the comparator function, I need access to a variable that stores the size of each element in bytes. std::bsearch is a C function, so the compiler does not allow me to access a class member variable that stores the size of each element in bytes. Global variable works, but it is ugly to store it as a global variable...

std::binary_search takes an iterator. I guess I need iterator to skip certain number of bytes. I am not sure how to make this happen though.

Bitwize · Accepted Answer

The first thing that comes to mind for this is to define a custom iterator that wraps another iterator and operates on a sequence of N bytes at a time. This would allow the wrapped iterator to be easily used in std::binary_search without being stuck rewriting it.

Basically, the utility would have the following:

The iterator should be random access,
Each increment (operator++) operates on N bytes at a time,
Each decrement (operator--) operates on N bytes at a time,
Each addition (operator+) advances by N * i bytes at a time,
Each subtraction (operator-) reports a distance of d / N (to account for the N bytes, and
The value type of the iterator will be some large enough int to easily hold the arbitrary value lengths (e.g. std::int64_t or something)

Something like:

// Iterator that skips N entries over a byte type
template 
class byte_iterator
{
public:
    // Ensure that the underlying type is 'char' (or some other 'byte' type, like std::byte)
    static_assert(std::is_same::value, "");
 
    using value_type = std::uint64_t;
    using reference = std::uint64_t;
    using pointer = value_type*;
    using size_type = std::size_t;
    using difference_type = std::ptrdiff_t;
    using iterator_category = std::random_access_iterator_tag;

    byte_iterator() = default;

    explicit byte_iterator(Iterator underlying)
      : m_it{underlying}
    {
    }

    byte_iterator(const byte_iterator&) = default;
    byte_iterator& operator=(const byte_iterator&) = default;

    byte_iterator& operator++() {
        std::advance(m_it, N);
        return (*this);
    }

    byte_iterator operator++(int) {
        auto copy = (*this);
        std::advance(m_it, N);
        return copy;
    }

    byte_iterator& operator--() {
        std::advance(m_it, -static_cast(N));
        return (*this);
    }

    byte_iterator operator--(int) {
        auto copy = (*this);
        std::advance(m_it, -static_cast(N));
        return copy;
    }

    byte_iterator& operator+=(std::ptrdiff_t n) {
        std::advance(m_it, static_cast(N) * n);
        return (*this);
    }

    byte_iterator& operator-=(std::ptrdiff_t n) {
        std::advance(m_it, -static_cast(N) * n);
        return (*this);
    }

    value_type operator*() const {
        // build your int here
        // The question doesn't indicate endianness, so this is up to you
        // this can just be a bunch of shifts / masks to create the int
    }

    bool operator==(const byte_iterator& other) const {
       return m_it == other.m_it;
    }

    bool operator!=(const byte_iterator& other) const {
       return m_it != other.m_it;
    }

    Iterator underlying() const { return m_it; }

private:
    Iterator m_it;
};

template 
std::ptrdiff_t operator-(const byte_iterator& lhs, const byte_iterator& rhs) {
    return (lhs.underlying() - rhs.underlying()) / 3;
}

template 
byte_iterator operator+(const byte_iterator& lhs, std::ptrdiff_t rhs) {
    return byte_iterator{lhs} += rhs;
};

template 
byte_iterator operator-(const byte_iterator& lhs, std::ptrdiff_t rhs) {
    return byte_iterator{lhs} -= rhs;
};


// other necessary operators...

template 
byte_iterator> make_byte_iterator(Iterator it) {
    return byte_iterator>{it};
}

Note: N doesn't have to be fixed at compile time as a template argument, it could also be done at runtime as a constructor argument. Similarly Iterator doesn't have to be a templated type; I simply use this as an example so that it may work with any underlying iterator. Realistically you could simply make this a char* or something so that it only works on contiguous byte arrays.

All that needs to be implemented is the reconstruction of the int for the purposes of comparison, which can be done from a series of 8-bit left shifts and masks depending on the endianness.

With an iterator wrapper like this, we can use std::lower_bound for a simple binary search to find the iterator:

// Assuming 3-byte numbers:

const auto value = 42;
auto byte_numbers = std::vector{...};

// ensure we have an increment of 3 bytes, otherwise the 
// iterators will break
assert(byte_numbers.size() % 3 == 0);

auto result = std::lower_bound(
    make_byte_iterator<3>(byte_numbers.begin()),
    make_byte_iterator<3>(byte_numbers.end()), 
    value
);

The result iterator will by a byte_iterator that contains an underlying iterator to the start of the N-byte number you need to retrieve. If you need the underlying iterator, you can just call result.underlying()

Alternatively, this can be used in std::binary_search to detect existence of the element in general without finding the underlying iterator (as mentioned in the question).

Note: The above code is untested -- it may have a typo or two, but the process should work as described.

Edit: Also be aware that this will only work correctly if the underlying iterator range is a multiple of N! If it were not, then you will report the incorrect range, which will also cause advancing the iterator to potentially move past the end iterator, which would be undefined behavior. It's important that you check the boundaries first before wrapping an iterator in this way.

Edit 2: If the N-byte integers can exceed a large value like int64_t, but follow a simple heuristic, you can also make the value_type of the iterator be a custom arbitrary_int type that has a well-defined comparison operator. For example:

template 
class arbitrary_int {
public:
    arbitrary_int(const unsigned char* integer)
      : m_int{integer}
    {
    }

    // assuming we can lexicographically compare all bytes
    bool operator==(const arbitrary_int& rhs) const {
        return std::equal(m_int, m_int + N, rhs.m_int, rhs.m_int + N);
    }
    bool operator!=(const arbitrary_int& rhs) const {
        return !(*this == rhs);
    }
    bool operator<(const arbitrary_int& rhs) const {
        return std::lexicographical_compare(m_int, m_int + N, rhs.m_int, rhs.m_int + N);
    }
    bool operator>(const arbitrary_int& rhs) const {
        return rhs < *this;
    }
    bool operator<=(const arbitrary_int& rhs) const {
        return !(rhs < *this);
    }
    bool operator>=(const arbitrary_int& rhs) const {
        return !(*this < *rhs);
    }

private:
    const unsigned char* m_int;
};

In this case, byte_iterator::operator*() would now return

return arbitrary_int{m_it.operator->()}

Or if you're using c++20,

return std::to_address(m_it);

Binary search a list of integers that are encoded with arbitary number of bytes

Answers (2)

Related Questions