Aman Dalmia
Aman Dalmia

Reputation: 406

Adding noise to genomic data having discrete values (A, G, T, C)

Since genomic sequences vary greatly in length, I have been trying to work on using denoising autoencoders to get a compact representation for any given sequence. My expected input is a sequence of nucleotides (letters - A, G, T, C), for example, "AAAAGGAATTTCTCTGGGG....".

For images, adding a noise is easy since it's a continuous space. But in a discrete scenario such as this, what would be a good strategy to add noise to my input?

My first thought is to randomly replace some of the nucleotides with "N", which means that the nucleotide at that position couldn't be identified accurately during sequencing. But changing even one nucleotide leads to a completely different sequence altogether, unlike images where adding a small noise doesn't change how the image looks visually. Please let me know if this is right or there's a better way that I am not aware of.

Upvotes: 0

Views: 185

Answers (1)

BioGeek
BioGeek

Reputation: 22847

I'm not sure if this will help you or further complicate your issue, but in biology people normally use FASTQ files to store biological sequences and their corresponding Phred quality scores. A Phred quality score is a measure of the quality of the identification of the nucleobases generated by automated DNA sequencing.

For example, if Phred assigns a quality score of 30 to a base, the chances that this base is called incorrectly are 1 in 1000.

Phred quality scores shown on a DNA sequence trace Public domain image from Wikipedia

So you can add noise to the Phred quality scores (i.e. the probabilities that the base calling is correct) without changing the sequence.

Also see this paragraph about current work done on compressing FASTQ files.

Upvotes: 1

Related Questions