Efficient Data Storage For Nucleotides With Common Repeats

Question

I'm working on a fun problem regarding finding a more efficient way to store the genome of the human malaria parasite, and I thought it would be useful to get some our your insights!

So here's the background info: suppose we're only using 2 bits to store all 4 nucleotides of the genome (A, C, T, G), but because the genome is still SUPER long, we know it takes up a ton of space. However, we know that 80% of the genome is either A or T - how can we use this knowledge to our advantage to store the genome in a more efficient way?

Right now I'm playing around with a couple ideas:

Find some way to encode large strings of A's or large strings of T's - this would require more than 2 bits, but if the strings are especially large, it could reduce size. For example, if '01' was code for 'T', '1101' could be code for '3 T's' (using the normal binary system after the first two bits). This would save us two bits.
Simply store A as '0' and T as '1' to reduce the number of bits these letters use.

Anyone else have any good ideas for making this data storage as efficient as possible? I'd love to hear 'em and discuss!

Efficient Data Storage For Nucleotides With Common Repeats

Answers (0)

Related Questions