I'm trying to understand the basic steps of FASTA algorithm in searching similar sequences of a query sequence in a database. These are the steps of the algorithm: Identify common k-words between I and J Score diagonals with k-word matches, identify 10 best diagonals Rescore initial regions with a substitution score matrix Join initial regions using gaps, penalise for gaps Perform dynamic programming to find final alignments I'm confused with the 3rd and 4th steps in using PAM250 score matrix, and how to "join using gaps". Can somebody explain these two steps for me "as specifically as possible". Thanks

Reputation: 2193

FASTA Algorithm Explanation

I'm trying to understand the basic steps of FASTA algorithm in searching similar sequences of a query sequence in a database. These are the steps of the algorithm:

Identify common k-words between I and J
Score diagonals with k-word matches, identify 10 best diagonals
Rescore initial regions with a substitution score matrix
Join initial regions using gaps, penalise for gaps
Perform dynamic programming to find final alignments

I'm confused with the 3rd and 4th steps in using PAM250 score matrix, and how to "join using gaps".

Can somebody explain these two steps for me "as specifically as possible". Thanks

Upvotes: 7

Answers (2)

Bill Pearson

Reputation: 41

The explanation is essentially correct, but the final band optimization is centered on the one best ungapped alignment found in step 2. Step 3 is used simply to improve the sensitivity in the choice of sequences that get step 4.

The original paper can be seen here: http://faculty.virginia.edu/wrpearson/papers/pearson_lipman_pnas88.pdf

Upvotes: 3

reve_etrange

Reputation: 2571

This is how FASTA works:

Find all k-length identities, then find locally similar regions by selecting those dense with k-word identities (i.e. many k-words, without too many gaps between). The best ten initial regions are used.
The initial regions are re-scored along their lengths by applying a substitution matrix in the usual way. Optimally scoring subregions are identified.
Create an alignment of the trimmed initial regions using dynamic programming, with a gap penalty of 20. Regions with too low of a score are not included.
Optimize the alignment from 3) using "banded" dynamic programming (Smith-Waterman). This is dynamic programming restricted to the 32 residue-wide band around the original alignment, which saves space and time over full dynamic programming.

If there are insufficient initial regions to form an alignment in 3), the best score from 2) can be used to rank sequences by similarity. Scores from 3) and 4) can also be used for that purpose.

Unfortunately my institution doesn't have access to the original FASTA paper so I can't supply the original values of the various parameters mentioned above.

Upvotes: 9

FASTA Algorithm Explanation

Answers (2)

Related Questions