
Fig
2.1. A table of the amino acid codes and their properties,
accompanied by a widely used colour scheme. |
Similarity
Scoring Matrices
In the previous Chapter, you dealt
with an exact matching method, where each amino acid self-match is considered
equally. However, as the table above illustrates, amino acids can be grouped
together in different classes, based on their physico-chemical properties. Such
amino acid classes can be said to contain "similar" amino acids. Analysis
of mutation statistics in protein families allows one to quantify the likelihood
of interchangeability between different amino acids in a protein. As expected,
amino acids that are physicochemically similar are often more interchangeable
than ones that are not. Different scoring matrices have been devised to describe,
quantitatively, the "similarity" of each amino acid with respect to
all the others. Examples of such matrices include the PAM and BLOSUM series.
Such matrices may be used to help alignment/searching algorithms score the overall
similarity between two (or more) stretches of sequence.
Fig 2.2 below depicts
the log odds form of the mutation data matrix for PAM250s.

Fig
2.2. Log odds form of the mutation data matrix for PAM250s. |
Heuristic algorithms
Due
to the large amount of information stored in databases, searching for the optimal
alignment between a query sequence and target sequences in a database can take
a lot of time. As a result, heuristic algorithms have been developed. Heuristic
algorithms find a result quite close to the best match in a much shorter period
of time, by "ignoring" or "suppressing" some, normally unlikely,
possibilities.
For
example, rather than doing a complete sequence alignment, BLAST2 narrows down
its search for similar regions in two sequences by focusing on high-scoring
"words" (see Fig 2.3), and then extending out from these. It uses
a more stringent threshold to find the high-scoring words, and then extends
out with a less stringent one.
It
being heuristic does not mean that it won't find the best match, but
can only be verified with another
more powerful non-heuristic method.
The two most widely used algorithms are FASTA
and BLAST.

Fig
2.3. Schematic representation of how BLAST works. |
Dynamic
algorithms
This type of algorithm
solves a problem by combining solutions to sub-problems that are computed once
and saved in a matrix. An optimal final solution is then built by combining
the best sub-solutions. Matches in the alignments are scored in an additive
fashion (i.e., the score from the previous residue is added to the
score of the current match), according to the substitution matrix used.
The two most common dynamic
programming approaches are local (Smith-Waterman) and global alignment (Needleman-Wunch).
These differ in the way in which gaps are scored: with the local Smith-Waterman,
all gaps are scored as zero; in the global Needleman-Wunch, gaps are penalised
cumulatively.
Dynamic algorithms tend
to be slower but more accurate than heuristic ones (especially slower in multiple
alignments). Illustrations of such algorithms can be found here
and here
(the latter focuses on the difference between local and global alignment). NOTE:
these require a current version of a Java-enabled browser.
By contrast with keyword searching
and identical sequence matching (as performed in the previous Chapter), it is
also possible to search for sequence similarities. Similarity search algorithms
are usually classified as global or local. Global algorithms optimise the full
alignment of 2 sequences, which may include large dissimilar regions. Local similarity
algorithms focus on conserved subsequences, a single comparison often yielding
several alignments (dissimilar regions do not contribute to the similarity measure).
Local similarity tools, such as BLAST, are usually preferred for database searches,
where for example cDNAs may be compared with partially sequenced genes, or where
distantly-related proteins may share only isolated regions of similarity (e.g.,
in the vicinity of an active site). Interpretation of results is straightforward,
requiring one to identify high-scoring sequences, or groups of sequences, with
low probabilities that such matches may have arisen by chance. This probability
function is called the E-value (expect value) and its statistical importance is
illustrated in Fig 2.4. For a more thorough explanation of the statistics used
by BLAST, please refer to the NCBI
BLAST tutorial pages.

Fig
2.4. Statistical importance of the E-value in similarity searching. |
A recent, hybrid approach, combining
elements of both pairwise and multiple sequence alignment methods is Position-Specific
Iterated (PSI)-BLAST. Although motif-based methods are recognised as being highly
sensitive and selective, their main drawback is that the derivation of diagnostic
family signatures is time-consuming. The innovation of PSI-BLAST is that, following
an initial database search, it allows automatic creation of position-specific
profiles from groups of results that match a query above a defined threshold,
as illustrated in Fig 2.5. Running the program several times can further refine
the profile and increase search sensitivity.

Fig
2.5. Outline of the Position Specific Iterated BLAST algorithm. |
As with other iterative
methods, however, PSI-BLAST is not infallible and can have disadvantages. For
example, unless low-complexity segments are masked, automated iterative searches
tend to degenerate to compositionally biased sequences (such as collagen, etc.),
leading to profile dilution. Once a false sequence has crept into a profile,
the search will thereafter be biased to accept many more unrelated sequences.
It is therefore vital to validate relationships inferred from unsupervised iterative
profile searches to eliminate erroneous matches.
|