Encoding DNA strand in Binary

Question

Hey guys I have the following question:

Suppose we are working with strands of DNA, each strand consisting of a sequence of 10 nucleotides. Each nucleotide can be any one of four different types: A, G, T or C. How many bits does it take to encode a DNA strand?

Here is my approach to it and I want to know if that is correct.

We have 10 spots. Each spot can have 4 different symbols. This means we require 4^10 combinations using our binary digits.

4^10 = 1048576.

We will then find the log base 2 of that. What do you guys think of my approach?

Peter Cordes · Accepted Answer

Each nucleotide (aka base-pair) takes two bits (one of four states -> 2 bits of information). 10 base-pairs thus take 20 bits. Reasoning that way is easier than doing the log2(4^10), but gives the same answer.

It would be fewer bits of information if there were any combinations that couldn't appear. e.g. some codons (sequence of three base-pairs) that never appear. But ten independent 2-bit pieces of information sum to 20 bits.

If some sequences appear more frequently than others, and a variable-length representation is viable, then Huffman coding or other compression schemes could save bits most of the time. This might be good in a file-format, but unlikely to be good in-memory when you're working with them.

Densely packing your data into an array of 2bit fields makes it slower to access a single base-pair, but comparing the whole chunk for equality with another chunk is still efficient. (memcmp).

20 bits is unfortunately just slightly too large for a 16bit integer (which computers are good at). Storing in an array of 32bit zero-extended values wastes a lot of space. On hardware with good unaligned support, storing 24bit zero-extended values is ok (do a 32bit load and mask the high 8 bits. Storing is even less convenient though: probably a 16b store and an 8b store, or else load the old value and merge the high 8, then do a 32b store. But that's not atomic.).

This is a similar problem for storing codons (groups of three base-pairs that code for an amino acid): 6 bits of information doesn't fill a byte. Only wasting 2 of every 8 bits isn't that bad, though.

Amino-acid sequences (where you don't care about mutations between different codons that still code for the same AA) have about 20 symbols per position, which means a symbol doesn't quite fit into a 4bit nibble.

I used to work for the phylogenetics research group at Dalhousie, so I've sometimes thought about having a look at DNA-sequence software to see if I could improve on how they internally store sequence data. I never got around to it, though. The real CPU intensive work happens in finding a maximum-likelihood evolutionary tree after you've already calculated a matrix of the evolutionary distance between every pair of input sequences. So actual sequence comparison isn't the bottleneck.

Encoding DNA strand in Binary

Answers (2)

Related Questions