Jim Garrison
Jim Garrison

Reputation: 4276

Is there a generalization of std::bitset for two-bit values?

Suppose I am a genome scientist trying to store extremely long strings of characters, each of which represents two bits of information (i.e. each element is either G, A, T, or C). Because the strings are incredibly long, I need to be able to store a string of length N in precisely 2N bits (or rather, N/4 bytes).

With that motivation in mind, I am looking for a generalization of std::bitset (or boost::dynamic_bitset<>) that works on two-bit values instead of single-bit values. I want to store N such two-bit values, each of which can be 0, 1, 2, or 3. I need the data packed as closely as possible in memory, so vector<char> will not work (as it wastes a factor of 4 of memory).

What is the best way to achieve my goal? One option is to wrap the existing bitset templates with customized operator[], iterators, etc., but I'd prefer to use an existing library if at all possible.

Upvotes: 4

Views: 943

Answers (3)

Keith
Keith

Reputation: 6834

std::bitset<> is fixed length and you probably do not want that.

I think you should go ahead and wrap std::vector<bool>.

Note that std::vector<bool> is optimised for space, but has the benefit that it is dynamic in size. Presumably you need to read the genome of arbitrary length on from somewhere.

Have a think about whether you need much of an API to access it; you might only need a couple of methods.

@Jefffrey's answer already covers the relevant code, if for bitset<>.

[ I am not familiar with boost::dynamic_bitset<> and what it might give over vector.]

One further thought is whether it might be convenient for you to work with quads of letters, a quad nicely filling a char in space.

class Genome
{
public:
    enum class Letter {A,C,G,T};
    Genome(const std::string& source)
    {
        code_.resize(source.size() * 2);
        for (unsigned index = 0; index != source.size(); ++index)
        {
            char text = source[index];
            Letter letter = textToLetter(text);
            set(index, letter);
        }
    }  
    static Letter textToLetter(char text)
    {
        // Or search through the array `letterText`.
        // Or come up with a neat but unintelligible one liner ...
        Letter letter = Letter::A;
        switch (text)
        {
        case 'A':
            letter = Letter::A;
            break;
        case 'C':
            letter = Letter::C;
            break;
        case 'G':
            letter = Letter::G;
            break;
        case 'T':
            letter = Letter::T;
            break;
        default:
            // Invalid - handle error.
            break;
        }
        return letter;
    }
    static char letterToText(Letter l) 
    {
        return letterText[(unsigned)l];
    }
    // Add bounds checking
    Letter get(unsigned index) const
    {
        unsigned distance = index * 2;
        char numeric = code_[distance] + code_[distance + 1] * 2;
        return Letter(numeric);
    }
    // Add bounds checking
    void set(unsigned index, Letter value)
    {
        unsigned distance = index * 2;
        bool low = (unsigned)value & 1;
        bool high = (bool)((unsigned)value & 2);
        code_[distance] = low;
        code_[distance + 1]  = high;
    }
    unsigned size()
    {
        return code_.size() / 2;
    }
    // Extend by numLetters, initially set to 'A'
    void extend(unsigned numLetters)
    {
        code_.resize(code_.size() + numLetters * 2);
    }
private:

    static char letterText[4];
    std::vector<bool> code_;
};

char Genome::letterText [4] = { 'A', 'C', 'G', 'T' };

int main()
{
    Genome g("GATT");
    g.extend(3);
    g.set(5, Genome::Letter::C);
    for (unsigned i = 0; i != g.size(); ++i)
        std::cout << Genome::letterToText(g.get(i));
    std::cout << std::endl;
    return 0;
}

Upvotes: 2

Pseudonym
Pseudonym

Reputation: 2696

Here's what I use for fixed-length k-mers.

#include <cstdint>
#include <cstdlib>
#include <ostream>

enum class nucleotide { A, C, G, T };

inline std::ostream&
operator<<(std::ostream& pOut, nucleotide pNt)
{
    switch (pNt) {
        case nucleotide::A: pOut << 'A'; break;
        case nucleotide::C: pOut << 'C'; break;
        case nucleotide::G: pOut << 'G'; break;
        case nucleotide::T: pOut << 'T'; break;
    }
    return pOut;
}

class kmer_base;

class nucleotide_proxy {
public:
    operator nucleotide() const {
        return nucleotide((*mWord >> (mPosition * 2)) & 3);
    };

    nucleotide_proxy& operator=(nucleotide pNt) {
        uint64_t word = *mWord;
        word &= ~(uint64_t(3) << (mPosition*2));
        word |= uint64_t(pNt) << (mPosition*2);
        *mWord = word;

        return *this;
    };

private:
    friend class kmer_base;

    nucleotide_proxy(uint64_t* pWord, uint8_t pPosition)
        : mWord(pWord), mPosition(pPosition)
    {
    }

    uint64_t* mWord;
    uint8_t mPosition;
};


class kmer_base {
protected:
    nucleotide_proxy access(uint64_t* pWord, size_t pPosition)
    {
        return nucleotide_proxy(pWord + (pPosition / 32), (pPosition & 31));
    }

    const nucleotide_proxy access(uint64_t* pWord, size_t pPosition) const
    {
        return nucleotide_proxy(pWord + (pPosition / 32), (pPosition & 31));
    }
};


template<int K>
class kmer : public kmer_base
{
    enum { Words = (K + 31) / 32 };
public:
    nucleotide_proxy operator[](size_t pOutdex) {
        return access(mWords, pOutdex);
    }

    const nucleotide_proxy operator[](size_t pOutdex) const {
        return access(mWords, pOutdex);
    }

private:
    uint64_t mWords[Words];
};

Extending this to dynamic-length k-mere is left as an exercise; it's pretty easy once you have nucleotide_proxy at your disposal. Implementing the reverse complement operator efficiently is also left as an exercise.

Upvotes: 0

Shoe
Shoe

Reputation: 76240

You have two choices.

Given:

enum class nucleobase { a, c, g, t };

You have two choices. You can:

  • use a single std::bitset and play with indexing
  • use std::bitset in combination with another container

For the first, you can just define a couple of functions that target the correct number of bits per set/get:

template<std::size_t N>
void set(std::bitset<N>& bits, std::size_t i, nucleobase x) {
    switch (x) {
        case nucleobase::a: bits.set(i * 2, 0); bits.set(i * 2 + 1, 0); break;
        case nucleobase::c: bits.set(i * 2, 0); bits.set(i * 2 + 1, 1); break;
        case nucleobase::g: bits.set(i * 2, 1); bits.set(i * 2 + 1, 0); break;
        case nucleobase::t: bits.set(i * 2, 1); bits.set(i * 2 + 1, 1); break;
    }
}

template<std::size_t N>
nucleobase get(const std::bitset<N>& bits, std::size_t i) {
    if (!bits[i * 2])
        if (!bits[i * 2 + 1]) return nucleobase::a;
        else                  return nucleobase::c;
    else
        if (!bits[i * 2 + 1]) return nucleobase::g;
        else                  return nucleobase::t;
}

Live demo

The above is just an example and a terrible one (it's almost 4AM here and I really need to sleep).

For the second you just need to map alleles and bits:

bit_pair bits_for(nucleobase x) {
    switch (x) {
        case nucleobase::a: return bit_pair("00"); break;
        case nucleobase::c: return bit_pair("10"); break;
        case nucleobase::g: return bit_pair("01"); break;
        case nucleobase::t: return bit_pair("11"); break;
    }
}

nucleobase nucleobase_for(bit_pair x) {
    switch (x.to_ulong()) {
        case 0: return nucleobase::a; break;
        case 1: return nucleobase::c; break;
        case 2: return nucleobase::g; break;
        case 3: return nucleobase::t; break;
        default: return nucleobase::a; break; // just for the warning
    }
}

Live demo

Of course if you need runtime length you can just use boost::dynamic_bitset and std::vector.

Upvotes: 1

Related Questions