c++algorithmcompressiondna-sequencelossless-compression

Reputation: 105

How can we compress DNA string efficiently

DNA strings can be of any length comprising any combination of the 5 alphabets (A, T, G, C, N).
What could be the efficient way of compressing DNA string of alphabet comprising 5 alphabets (A, T, G, C, N). Instead of considering 3 bits per alphabet, can we compress and retrieve efficiently using less number of bits? Can anybody suggest a pseudo code for efficient compression and retrieval?

Upvotes: 5

Answers (5)

dim

Reputation: 11

There are plenty of methods to compress, but the main question is what data you want to compress? 1. Raw unaligned sequenced data from the sequencing machine (fastq) 2. Aligned data (sam/bam/cram) 3. Reference genomes

You should reorder your reads putting reads from the close genome positions close to each other. For instance, this would allow usual gzip compress 3 times better. There are many ways to do this. You can for instance align fastq to bam and than export back to fastq. Use Suffix Tree/Array to find similar reads, the way most aligners work (needs a lot of memory). Use minimizers - very fast, low memory solution, but not good for long reads with many errors. Good results are from debruijn graph construction, which is used for the purpose (aka de-novo alignment). Statistical coding like huffman / arithmetic would compress to 1/3 (one can pass huffman stream to binary arithmetic coder to gain another 20%).
The best results are from reference-based compressions here - just store differences between a reference and the aligned read.
Little can be done here. Using statistical coding you can get 2-3 bits per nucleotide.

Upvotes: 1

גלעד ברקן

Reputation: 23955

We can use a combination of Roee Gavirel's idea and the following for an even tighter result. Since Roee's idea still stipulates that two out of our five characters be mapped to a 3-bit word, sections of the sequence where at least one of the five characters does not appear but one of the 3-bit words does could be mapped to 2-bit words, reducing our final result.

The condition to switch mapping is if there exists a section where at least one of the five characters does not appear and one of the 3-bit words appears just one time more than two times our section-prefix length in bits. If we order our possible characters (for example, alphabetically), given three bits indicating a specific missing character (if there's more than one missing, we choose the first in order) or none missing, we can immediately assign a consistent 2-bit mapping for the other four characters.

Two ideas for our prefixes:

(1)

3 bits: the missing character (if none, we use Roee's encoding for the section);
x bits: a constant number of bits representing section length. For maximal length sections of 65000, we could assign x = 16.

To justify the prefix use, we'd need a section where one of the five characters does not appear and one of the 3-bit words appears 39 times or more.

(2)

3 bits: the missing character (if none, we use Roee's encoding for the section);
x bits: the number of bits in the next section of the prefix - depends on how many characters the longest section could be. x = 6 would imply the maximal section length could be 2^(2^6)! Unlikely. For maximal length sections of 65000, we could assign x = 4;
the number of bits indicated in the previous part of the prefix, indicating the current section length.

In the example just above, our prefix length could vary between say 11 and 23 bits, which means to justify its use, we'd need a section where one of the five characters does not appear and one of the 3-bit words appears between 23 to 47 times or more.

Upvotes: 0

comingstorm

Reputation: 26117

Frankly, I would start with some version of Lempel-Ziv compression (a class of compression algorithms that includes the general-purpose gzip compression format). I note that some of the comments say that general-purpose compression algorithms don't work well on raw genome data, but their effectiveness depends on how the data is presented to them.

Note that most general-purpose compression programs (like gzip) examine their input on a per-byte basis. This means that "pre-compressing" the genome data at 3 bits/base is counterproductive; instead, you should format the uncompressed genome data at one byte per base before running it through a general-purpose compressor. Ascii "AGTCN" coding should be fine, as long as you don't add noise by including spaces, newlines, or variations in capitalization.

Lempel-Ziv compression methods work by recognizing repetitive substrings in their input, then encoding them by reference to the preceding data; I'd expect this class of methods should do a reasonably good job on appropriately presented genome data. A more genome-specific compression method may improve on this, but unless there's some strong, non-local constraint on genome encoding that I'm unaware of, I would not expect a major improvement.

Upvotes: 0

rustyx

Reputation: 85541

You have 5 unique values, so you need a base-5 encoding (e.g. A=0, T=1, G=2, C=3, N=4).

In 32 bits you can fit log₅(2³²) = 13 base-5 values.

In 64 bits you can fit log₅(2⁶⁴) = 27 base-5 values.

The encoding process would be:

uint8_t *input = /* base-5 encoded DNA letters */;
uint64_t packed = 0;
for (int i = 0; i < 27; ++i) {
    packed = packed * 5 + *input++;
}

And decoding:

uint8_t *output = /* allocate buffer */;
uint64_t packed = /* next encoded chunk */;
for (int i = 0; i < 27; ++i) {
    *output++ = packed % 5;
    packed /= 5;
}

Upvotes: 3

Roee Gavirel

Reputation: 19473

you can if you willing to (a) have different bits size for each char and (b) you are always reading from the start and never from the middle. then, you can have a code something like:

A - 00
T - 01
G - 10
C - 110
N - 111

Reading from left to right you can only split a stream of bits to chars in one way. You read 2 bits at a time and if they are "11" you need to read one more bit to know what char it is.

This is based on Huffman Coding Algorithm

Note:
I don't know much about DNA, but if the probability of the chars is not equal (meaning 20% each). you should assign the shortest codes to those with the higher probability.

Upvotes: 7

How can we compress DNA string efficiently

Answers (5)

Related Questions