SimonSalman
SimonSalman

Reputation: 351

cluster short, homogeneous strings (DNA) according to common sub-patterns and extract consensus of classes

Task:
to cluster a large pool of short DNA fragments in classes that share common sub-sequence-patterns and find the consensus sequence of each class.

Plan:
to perform a multiple alignment (i.e. withClustalW2) to find classes that share common sequences in region 2 and their consensus sequences.

Questions:

  1. Are my fragments too short, and would it help to increase their size?
  2. Is region 2 too homogeneous, with only two allowed letter types, for showing patterns in its sequence?
  3. Which alternative methods or tools can you suggest for this task?

Best regards,

Simon

Upvotes: 6

Views: 588

Answers (2)

Ron Gejman
Ron Gejman

Reputation: 6215

Yes, 300 is FAR TOO FEW considering that this is the human genome and you're essentially just looking for a particular 8-mer. There are 65,536 possible 8-mers and 3,000,000,000 unique bases in the genome (assuming you're looking at the entire genome and not just genic or coding regions). You'll find G/C containing sequences 3,000,000,000 / 65,536 * 2^8 =~ 12,000,000 times (and probably much more since the genome is full of CpG islands compared to other things). Why only choose 300?

You don't want to use regex's for this task. Just start at chromosome 1, look for the first CG or GC and extend until you get your first non-G-or-C. Then take that sequence, its context and save it (in a DB). Rinse and repeat.

For this project, Clustal may be overkill -- but I don't know your objectives so I can't be sure. If you're only interested in the GC region, then you can do some simple clustering like so:

  1. Make a database entry for each G/C 8-mer (2^8 = 256 in all).
  2. Take each GC-region and walk it to see which 8-mers it contains.
  3. Tag each GC-region with the sequences it contains.

Now, for each 8-mer, you have thousands of sequences which contain it. I'll leave the analysis of the data up to your own objectives.

Upvotes: 2

Calyth
Calyth

Reputation: 1673

Your region two, with the 2 letters, may end up a bit too similar, increasing length or variability (e.g. more letters) could help.

Upvotes: 1

Related Questions