Reputation: 4276
Suppose I am a genome scientist trying to store extremely long strings of characters, each of which represents two bits of information (i.e. each element is either G, A, T, or C). Because the strings are incredibly long, I need to be able to store a string of length N in precisely 2N bits (or rather, N/4 bytes).
With that motivation in mind, I am looking for a generalization of std::bitset
(or boost::dynamic_bitset<>
) that works on two-bit values instead of single-bit values. I want to store N
such two-bit values, each of which can be 0, 1, 2, or 3. I need the data packed as closely as possible in memory, so vector<char>
will not work (as it wastes a factor of 4 of memory).
What is the best way to achieve my goal? One option is to wrap the existing bitset templates with customized operator[]
, iterators, etc., but I'd prefer to use an existing library if at all possible.
Upvotes: 4
Views: 943
Reputation: 6834
std::bitset<>
is fixed length and you probably do not want that.
I think you should go ahead and wrap std::vector<bool>
.
Note that std::vector<bool>
is optimised for space, but has the benefit that it is dynamic in size.
Presumably you need to read the genome of arbitrary length on from somewhere.
Have a think about whether you need much of an API to access it; you might only need a couple of methods.
@Jefffrey's answer already covers the relevant code, if for bitset<>
.
[ I am not familiar with boost::dynamic_bitset<>
and what it might give over vector
.]
One further thought is whether it might be convenient for you to work with quads of letters, a quad nicely filling a char in space.
class Genome
{
public:
enum class Letter {A,C,G,T};
Genome(const std::string& source)
{
code_.resize(source.size() * 2);
for (unsigned index = 0; index != source.size(); ++index)
{
char text = source[index];
Letter letter = textToLetter(text);
set(index, letter);
}
}
static Letter textToLetter(char text)
{
// Or search through the array `letterText`.
// Or come up with a neat but unintelligible one liner ...
Letter letter = Letter::A;
switch (text)
{
case 'A':
letter = Letter::A;
break;
case 'C':
letter = Letter::C;
break;
case 'G':
letter = Letter::G;
break;
case 'T':
letter = Letter::T;
break;
default:
// Invalid - handle error.
break;
}
return letter;
}
static char letterToText(Letter l)
{
return letterText[(unsigned)l];
}
// Add bounds checking
Letter get(unsigned index) const
{
unsigned distance = index * 2;
char numeric = code_[distance] + code_[distance + 1] * 2;
return Letter(numeric);
}
// Add bounds checking
void set(unsigned index, Letter value)
{
unsigned distance = index * 2;
bool low = (unsigned)value & 1;
bool high = (bool)((unsigned)value & 2);
code_[distance] = low;
code_[distance + 1] = high;
}
unsigned size()
{
return code_.size() / 2;
}
// Extend by numLetters, initially set to 'A'
void extend(unsigned numLetters)
{
code_.resize(code_.size() + numLetters * 2);
}
private:
static char letterText[4];
std::vector<bool> code_;
};
char Genome::letterText [4] = { 'A', 'C', 'G', 'T' };
int main()
{
Genome g("GATT");
g.extend(3);
g.set(5, Genome::Letter::C);
for (unsigned i = 0; i != g.size(); ++i)
std::cout << Genome::letterToText(g.get(i));
std::cout << std::endl;
return 0;
}
Upvotes: 2
Reputation: 2696
Here's what I use for fixed-length k-mers.
#include <cstdint>
#include <cstdlib>
#include <ostream>
enum class nucleotide { A, C, G, T };
inline std::ostream&
operator<<(std::ostream& pOut, nucleotide pNt)
{
switch (pNt) {
case nucleotide::A: pOut << 'A'; break;
case nucleotide::C: pOut << 'C'; break;
case nucleotide::G: pOut << 'G'; break;
case nucleotide::T: pOut << 'T'; break;
}
return pOut;
}
class kmer_base;
class nucleotide_proxy {
public:
operator nucleotide() const {
return nucleotide((*mWord >> (mPosition * 2)) & 3);
};
nucleotide_proxy& operator=(nucleotide pNt) {
uint64_t word = *mWord;
word &= ~(uint64_t(3) << (mPosition*2));
word |= uint64_t(pNt) << (mPosition*2);
*mWord = word;
return *this;
};
private:
friend class kmer_base;
nucleotide_proxy(uint64_t* pWord, uint8_t pPosition)
: mWord(pWord), mPosition(pPosition)
{
}
uint64_t* mWord;
uint8_t mPosition;
};
class kmer_base {
protected:
nucleotide_proxy access(uint64_t* pWord, size_t pPosition)
{
return nucleotide_proxy(pWord + (pPosition / 32), (pPosition & 31));
}
const nucleotide_proxy access(uint64_t* pWord, size_t pPosition) const
{
return nucleotide_proxy(pWord + (pPosition / 32), (pPosition & 31));
}
};
template<int K>
class kmer : public kmer_base
{
enum { Words = (K + 31) / 32 };
public:
nucleotide_proxy operator[](size_t pOutdex) {
return access(mWords, pOutdex);
}
const nucleotide_proxy operator[](size_t pOutdex) const {
return access(mWords, pOutdex);
}
private:
uint64_t mWords[Words];
};
Extending this to dynamic-length k-mere is left as an exercise; it's pretty easy once you have nucleotide_proxy
at your disposal. Implementing the reverse complement operator efficiently is also left as an exercise.
Upvotes: 0
Reputation: 76240
You have two choices.
Given:
enum class nucleobase { a, c, g, t };
You have two choices. You can:
std::bitset
and play with indexingstd::bitset
in combination with another containerFor the first, you can just define a couple of functions that target the correct number of bits per set/get:
template<std::size_t N>
void set(std::bitset<N>& bits, std::size_t i, nucleobase x) {
switch (x) {
case nucleobase::a: bits.set(i * 2, 0); bits.set(i * 2 + 1, 0); break;
case nucleobase::c: bits.set(i * 2, 0); bits.set(i * 2 + 1, 1); break;
case nucleobase::g: bits.set(i * 2, 1); bits.set(i * 2 + 1, 0); break;
case nucleobase::t: bits.set(i * 2, 1); bits.set(i * 2 + 1, 1); break;
}
}
template<std::size_t N>
nucleobase get(const std::bitset<N>& bits, std::size_t i) {
if (!bits[i * 2])
if (!bits[i * 2 + 1]) return nucleobase::a;
else return nucleobase::c;
else
if (!bits[i * 2 + 1]) return nucleobase::g;
else return nucleobase::t;
}
The above is just an example and a terrible one (it's almost 4AM here and I really need to sleep).
For the second you just need to map alleles and bits:
bit_pair bits_for(nucleobase x) {
switch (x) {
case nucleobase::a: return bit_pair("00"); break;
case nucleobase::c: return bit_pair("10"); break;
case nucleobase::g: return bit_pair("01"); break;
case nucleobase::t: return bit_pair("11"); break;
}
}
nucleobase nucleobase_for(bit_pair x) {
switch (x.to_ulong()) {
case 0: return nucleobase::a; break;
case 1: return nucleobase::c; break;
case 2: return nucleobase::g; break;
case 3: return nucleobase::t; break;
default: return nucleobase::a; break; // just for the warning
}
}
Of course if you need runtime length you can just use boost::dynamic_bitset
and std::vector
.
Upvotes: 1