The amino acid codes and their properties

Fig 2.1. A table of the amino acid codes and their properties,
accompanied by a widely used colour scheme.

Similarity Scoring Matrices

In the previous Chapter, you dealt with an exact matching method, where each amino acid self-match is considered equally. However, as the table above illustrates, amino acids can be grouped together in different classes, based on their physico-chemical properties. Such amino acid classes can be said to contain "similar" amino acids. Analysis of mutation statistics in protein families allows one to quantify the likelihood of interchangeability between different amino acids in a protein. As expected, amino acids that are physicochemically similar are often more interchangeable than ones that are not. Different scoring matrices have been devised to describe, quantitatively, the "similarity" of each amino acid with respect to all the others. Examples of such matrices include the PAM and BLOSUM series. Such matrices may be used to help alignment/searching algorithms score the overall similarity between two (or more) stretches of sequence.

Fig 2.2 below depicts the log odds form of the mutation data matrix for PAM250s.

Fig 2.2. Log odds form of the mutation data matrix for PAM250s.


Heuristic algorithms

Due to the large amount of information stored in databases, searching for the optimal alignment between a query sequence and target sequences in a database can take a lot of time. As a result, heuristic algorithms have been developed. Heuristic algorithms find a result quite close to the best match in a much shorter period of time, by "ignoring" or "suppressing" some, normally unlikely, possibilities.

For example, rather than doing a complete sequence alignment, BLAST2 narrows down its search for similar regions in two sequences by focusing on high-scoring "words" (see Fig 2.3), and then extending out from these. It uses a more stringent threshold to find the high-scoring words, and then extends out with a less stringent one.

It being heuristic does not mean that it won't find the best match, but can only be verified with another more powerful non-heuristic method. The two most widely used algorithms are FASTA and BLAST.

Fig 2.3. Schematic representation of how BLAST works.

Dynamic algorithms

This type of algorithm solves a problem by combining solutions to sub-problems that are computed once and saved in a matrix. An optimal final solution is then built by combining the best sub-solutions. Matches in the alignments are scored in an additive fashion (i.e., the score from the previous residue is added to the score of the current match), according to the substitution matrix used.

The two most common dynamic programming approaches are local (Smith-Waterman) and global alignment (Needleman-Wunch). These differ in the way in which gaps are scored: with the local Smith-Waterman, all gaps are scored as zero; in the global Needleman-Wunch, gaps are penalised cumulatively.

Dynamic algorithms tend to be slower but more accurate than heuristic ones (especially slower in multiple alignments). Illustrations of such algorithms can be found here and here (the latter focuses on the difference between local and global alignment). NOTE: these require a current version of a Java-enabled browser.

Similarity searching with BLAST

By contrast with keyword searching and identical sequence matching (as performed in the previous Chapter), it is also possible to search for sequence similarities. Similarity search algorithms are usually classified as global or local. Global algorithms optimise the full alignment of 2 sequences, which may include large dissimilar regions. Local similarity algorithms focus on conserved subsequences, a single comparison often yielding several alignments (dissimilar regions do not contribute to the similarity measure). Local similarity tools, such as BLAST, are usually preferred for database searches, where for example cDNAs may be compared with partially sequenced genes, or where distantly-related proteins may share only isolated regions of similarity (e.g., in the vicinity of an active site). Interpretation of results is straightforward, requiring one to identify high-scoring sequences, or groups of sequences, with low probabilities that such matches may have arisen by chance. This probability function is called the E-value (expect value) and its statistical importance is illustrated in Fig 2.4. For a more thorough explanation of the statistics used by BLAST, please refer to the NCBI BLAST tutorial pages.

Fig 2.4. Statistical importance of the E-value in similarity searching.


A recent, hybrid approach, combining elements of both pairwise and multiple sequence alignment methods is Position-Specific Iterated (PSI)-BLAST. Although motif-based methods are recognised as being highly sensitive and selective, their main drawback is that the derivation of diagnostic family signatures is time-consuming. The innovation of PSI-BLAST is that, following an initial database search, it allows automatic creation of position-specific profiles from groups of results that match a query above a defined threshold, as illustrated in Fig 2.5. Running the program several times can further refine the profile and increase search sensitivity.


Fig 2.5. Outline of the Position Specific Iterated BLAST algorithm.

As with other iterative methods, however, PSI-BLAST is not infallible and can have disadvantages. For example, unless low-complexity segments are masked, automated iterative searches tend to degenerate to compositionally biased sequences (such as collagen, etc.), leading to profile dilution. Once a false sequence has crept into a profile, the search will thereafter be biased to accept many more unrelated sequences. It is therefore vital to validate relationships inferred from unsupervised iterative profile searches to eliminate erroneous matches.