CONTENTS THE PRINTS DATABASE 1.0 Introduction 1.1 What do we mean by a motif? 1.2 What do we mean by a fingerprint? 1.3 How are fingerprints generated? 1.4 What is the format for database entries? a) General field i) Composite fingerprints ii) Simple fingerprints b) Summary field i) Composite fingerprints ii) Simple fingerprints c) Discrimination indices i) Composite fingerprints ii) Simple fingerprints d) True and false matches, and subfamilies i) Composite fingerprints ii) Simple fingerprints e) Protein titles i) Composite fingerprints ii) Simple fingerprints f) Scan history i) Composite fingerprints ii) Simple fingerprints g) Initial motif-sets i) Composite fingerprints ii) Simple fingerprints h) Final motif-sets i) Composite fingerprints ii) Simple fingerprints 1.5 Associated software 1.6 The alignment database 2.0 References 2.1 Applications THE PRINTS DATABASE 1.0 Introduction The PRINTS database is a compendium of protein motif fingerprints. Each fingerprint has been defined and iteratively refined using database scanning procedures within the ADSP or VISTAS sequence analysis packages. Two types of fingerprint are represented in the database, i.e. they are either simple or composite, depending on their complexity: simple fingerprints are essentially single-motifs; while composite fingerprints encode multiple motifs. The bulk of the database entries are of the latter type because discrimination power is greater for multi-component searches, and results are consequently easier to interpret. There are two main reasons for compiling a database of this sort: (i) rationalisation of the vast amount of data contained within the OWL composite sequence database (i.e. resolving sequences into families, superfamilies and subfamilies); and (ii), as a consequence of this, improving the efficiency of sequence analysis in the future. There are 2 immediate ways in which sequence analysis is made more efficient: first, new sequences can be run against the entire database to get possible clues about structure or function; and second, database entries can be extracted to run against individual sequences or personal databases. Both options are essentially knowledge-based and thus allow extremely rapid diagnosis by comparison with time-consuming searches of the entire composite database. 1.1 What do we mean by a motif? A motif as any conserved element of a sequence alignment: it is a local alignment corresponding to a region whose function or structure is known, or its significance may be unknown. It is sufficient that it is conserved, and is hence likely to be predictive of any subsequent occurrence of such a structural/functional region in any other protein sequence. 1.2 What do we mean by a fingerprint? A fingerprint is a set of motifs used to predict the occurrence of similar motifs, either in an individual sequence or in a database. Fingerprints are refined, i.e. their diagnostic potency is enhanced, by iterative scanning of the OWL composite sequence database. Database searches with such aligned motifs are essentially frequency scans: i.e., in their most basic application, no secondary structure information, similarity data, or weighting scheme of any description is used to improve discrimination power. A composite or multiple-motif fingerprint contains a number of aligned motifs taken from different parts of a multiple alignment. Discrimination power is increased in these systems because the recognition of individual elements of the fingerprint is mutually conditional. True family members are then easy to identify by virtue of possessing all elements of the fingerprint, while subfamily members may be identified by possessing only part of it. 1.3 How are fingerprints generated? The fingerprinting method relies on the fact that, in any protein family, only parts of a sequence are held in common - these usually relate to key functional regions or to core structural elements of the fold. The starting point for fingerprint definition is thus multiple sequence alignment, which we achieve using the SOMAP, XALIGN or VISTAS manual alignment programs. Only small numbers need to be included in the initial alignment because the method itself is designed to add to the alignment with each database scan. Once a motif, or set of motifs, has been identified, the conserved regions are excised in the form of local alignments. There are no rules regarding the juxtaposition of such motifs, other than that they should not substantially overlap: thus motifs may occur immediately adjacent to one another, or they may be separated by any distance along the length of the alignment. Independent database scans are made with each aligned motif, resulting in the production of a set of hitlists, one for each motif. The hitlists are then analysed, or correlated, to determine which sequences in the database have matched with all elements of the fingerprint, and which have only matched with part of it. Only those sequences that match with all elements are regarded as true matches. If the search has worked well, the true set will contain more sequences than did the original alignment. The additional sequence data from the new true set is then used to generate another set of aligned motifs, and the database is searched again - where the family being diagnosed is very large, redundant motifs are removed from the alignments prior to the next scan. This process is repeated until convergence, the point at which the true set remains constant between successive scans. The final aligned motifs from this iterative procedure constitute the refined fingerprint that is entered into the PRINTS database. Good fingerprints find all true matches that exist in OWL, they exhibit clear discrimination cut-offs, and include little or no noise. Occasionally, cut-offs may be difficult to assess: this is usually the result of subfamily identification, where a sequence subset has been recognised that is characterised by only part of a fingerprint; but it may also result from the use of only 2 or 3 motifs in the fingerprint - discrimination power improves with the number of motifs used. 1.4 What is the format for database entries? The information held in a database entry is contained within a number of discrete fields, each designed to convey specific types of information to the user (and each of which can be accessed by its query language). For example, a brief overview of the type of fingerprint is given in the initial "general" field; a summary of the performance of the fingerprint, and of its individual elements, is given in appropriate "summary" fields; actual true- and/or false-positive matches are provided in subsequent fields, as are titles, and initial and final motifs themselves. For the purpose of illustration, we will deal with these in turn, making the distinction in each case between multiple-motif and single-motif fingerprints. a) General field i) Composite fingerprints The "general" field comprises a number of parts, detailing the entry name; the type of fingerprint; the equivalent PROSITE accession number and code (if one exists); the author and date of creation; references; and finally, documentation regarding the nature of the protein family under investigation and details of how the fingerprint was derived, etc. An example taken from the database is as follows:- DAGPE DIACYLGLYCEROL/PHORBOL-ESTER BINDING SIGNATURE Type of feature: COMPOSITE with 4 elements Prosite code: PS00479 DAG_PE_BINDING_DOMAIN; PATTERN Created by T.K.ATTWOOD, 5-APR-1991 (UPDATE M.E.BECK, 5-APR-1993) 1. KIKKAWA, U., KISHIMOTO, A. and NISHIZUKA, Y. The protein kinase C family: heterogeneity and its implications. ANNU.REV.BIOCHEM. 58 31-44 (1989). 2. ONO, Y., FUJII, T., IGARASHI, K., TANAKA, C., KIKKAWA, U. and NISHIZUKA, Y. Phorbol ester binding to protein kinase C requires a cysteine zinc-finger-like sequence. PROC.NATL.ACAD.SCI.USA 86 4868-4871 (1989). 3. BOGUSKI, M.S., BAIROCH, A., ATTWOOD, T.K. and MICHAELS, G.S. Proto-vav and gene expression. NATURE 358 113 (1992). Diacylglycerol (DAG) is an important second messenger; phorbol esters (PE) are analogues of DAG and are potent tumour promoters that cause a variety of physiological changes when administered to both cells and tissues. DAG activates a family of Ser/Thr protein kinases known collectively as protein kinase C (PKC) [1]: PE's can stimulate PKC directly. The N-terminal region of PKC, known as C1, has been shown to bind PE and DAG in a phospholipid-dependent fashion [2]. The C1 region of PKC contains 1 or 2 copies of a Cys-rich region of about 50 amino acid residues, which is essential for DAG/PE binding. DAGPE is a 4-element fingerprint that encodes this Cys-rich region. The fingerprint was derived from an initial alignment of 13 sequences (A.Bairoch, personal communication) and was used to scan OWL10.0. Con- vergence was reached after 2 iterations, at which point every known occurrence of the motif was detected (38 in all), together with an additional protein - the human vav oncogene. The implication is that the vav oncogene may well bind DAG and/or PE's and could thus be involved in signal transduction [3]. An update on OWL19.1 identified a true set comprising 61 sequences, together with 4 partial matches - all of these match motif 3 well, but correspond less well with either motifs 1, 2 or 4. In this example, DAGPE is the general code given to the fingerprint that describes the diacylglycerol/phorbol-ester binding domain. The fingerprint comprises 4 motifs, and further information can be obtained with reference to the PROSITE code DAG_PE_BINDING_DOMAIN (accession number PS00479). The fingerprint was derived on 5 April 1991 and the last update was 5 April 1993. Three references are provided, after which is given some information about diacylglycerol and phorbol esters, together with a description of the derivation of the fingerprint and its latest update. ii) Simple fingerprints The "general" field for single-motif fingerprints exactly parallels that for multiple-motif fingerprints, except that the nomenclature differs slightly: e.g. HELIX1N TYPE I ALPHA-HELIX N-TERMINAL SIGNATURE Type of feature: SIMPLE Created by D.J.PARRY-SMITH, 1-JUN-1989 Here we see the initial part of the "general" field for an entry with the general code HELIX1N, which describes type-I N-terminal alpha-helix sub- structures. This entry is a simple fingerprint, i.e. contains a single motif. There is no equivalent PROSITE code. b) Summary field i) Composite fingerprints The "summary" field describes the performance of the fingerprint in terms of the number of matches found with all of its n elements, the number with (n-1) elements, those with (n-2), and so on down to matches with just 2 elements. In general, true-positive hits match all n elements of the fingerprint, while sequences involving 2 elements tend to be random matches with different motifs, i.e. are usually noise. A good fingerprint exhibits a clear discrimination cut-off - i.e. shows all true positives matching with all n motifs, perhaps some noise, and few or no matches at intermediate positions of the summary table. Below is shown the summary information for the KRINGLE entry: SUMMARY INFORMATION 45 codes involving 4 elements 2 codes involving 3 elements 1 codes involving 2 elements Here, 45 true-positive matches are found, i.e. 45 sequences have matched with all 4 fingerprint elements. Only 1 match appears at the level of "noise", which in fact turns out to be a related sequence, and 2 other matches make partial matches with the fingeprrint (i.e. match only 3 of the 4 motifs), both of which are also related sequences. ii) Simple fingerprints The "summary" field for simple fingerprints exploits the same format as that for composite fingerprints, but the information content is subtley different. Rather than listing the number of matches with all or part of the fingerprint, here the number of occurrences of the motif is provided:- SUMMARY INFORMATION 26 codes involving this element 1 code involving 3 occurrences 7 codes involving 2 occurrences 18 codes involving 1 occurrence Thus, the "summary" field for HELIX1N indicates that there were 26 matches with the motif, of which 1 involved 3 occurrences, 7 involved 2 occurrences, and 18 involved just a single occurrence. c) Discrimination indices i) Composite Fingerprint Index The Composite Fingerprint Index (CFI) is an analysis of the hitlists produced by scanning the database with n separate motifs: it expands the information provided by the summary table by describing the performance of each of the individual motifs - this is useful because it allows us to see at a glance if there is a disparity in their diagnostic powers. The information is presented as a table, along the base of which run each of the n motif hitlists, and up the side of which run the number of motifs matched:- COMPOSITE FINGERPRINT INDEX 4| 45 45 45 45 3| 2 2 2 0 2| 1 0 1 0 --+--------------------- | 1 2 3 4 In the example shown, continuing with the KRINGLE entry, the 45 true-positive matches shown in the summary table, by definition, are seen to be common to all 4 hitlists; the two sequences that match just 3 fingerprint elements only match the first 3 motifs (suggesting either that the sequences are fragments, or that motif 4 is not diagnostic of all sequences in the family); and the sequence matching just 2 elements is not identified by the second and fourth hitlists (this is less likely to be a fragment, so motif 2 may be less diagnostic than motifs 1 and 3). An overview of the table indicates that all elements of the fingerprint have performed more or less equally well, but motifs 2 and particularly 4 have made fewer additional matches and are therefore seemingly less diagnostic than motifs 1 and 3. None picks up any noise. ii) Simple Fingerprint Index The Simple Fingerprint Index gives information concerning the derivation of the fingerprint: for each iteration made, it details the number of true-positive matches appearing in the hitlist before the first false-positive, together with the calculated ROC value - the ROC value is a quantitative estimation of discriminating power, where unity indicates perfect discrimination (see ADSP documentation). SIMPLE FINGERPRINT INDEX 0.406 0.955 0.988 0.990 (n = 0.07) 7 11 11 11 (tp -> fp) -------------------------- 1 2 3 4 (iteration) For HELIX1N, shown here, 7 true-positive (tp) matches were found before the first false-positive (fp) at the first iteration, giving a ROC value of 0.406. After the fourth iteration, however, 11 true-positive matches were found before the first false-positive and the calculated ROC value has increased to 0.990, indicating a substantial increase in discriminating power. Note that n is a parameter used to calculate the ROC value - it represents the ratio of the fraction of false-positives to true-positives at a particular scoring interval within the hitlist. d) True and false matches, and subfamilies i) Composite fingerprints The field immediately following the Composite Fingerprint Index is provided to identify the actual sequences matched by the fingerprint. The matches are summarised in the form of a list of all true-positive codes, followed by false- positive, true-negative and false-negative codes, if there are any. Beneath this are sequences making only partial matches, representing potential subfamilies - i.e. here are listed subfamily true-positives and false-positives, again if there are any. In some cases, sequences appearing in the subfamily fields are simply fragments, but for the sake of completeness these are still documented (a note is normally included in the general description to take account of such cases). In the KRINGLE entry we find:- True positives.. PLMN_HUMAN PLMN_MACMU PLHU PLMN_PIG PLMN_BOVIN S01678 APOA_HUMAN UROT_RAT APOA_MACMU UROT_DESRO JS0598 JS0599 A40522 UROK_HUMAN SYNMUTUPA A32974 JS0600 UROT_MOUSE MUSHGF1 MUSHGF2 HGF_RAT PLMN_MOUSE HGF_HUMAN S06794 UROK_PAPCY UROK_PIG UROT_HUMAN S18932 S24604 JS0597 HUMUKPPE FA12_HUMAN UROK_MOUSE N$1TPKA N$1TPKB N$1TPKC THRB_BOVIN BOVTHBNM HGFL_MOUSE THRB_HUMAN HGFL_HUMAN THRB_MOUSE THRB_RAT PLMN_CANFA PLMN_HORSE Subfamily: Codes involving 3 elements Subfamily True positives.. HUMROR1A HUMROR2A Subfamily: Codes involving 2 elements Subfamily True positives.. UROK_CHICK Here we have 45 true-positive codes, and no false-positives. Directly beneath these are the subfamiliy true-positives, and again no false positives. The true-positives include HUMROR1A and HUMROR2A, which match 3 fingerprint elements; and UROK_CHICK, which matches just 2. The general description tells us that none of these is a fragment: HUMROR1A and HUMROR2A both make only weak matches with motif 4, and UROK_CHICK fails to match motifs 2 and 4. The sequences thus appear to represent special cases. ii) Simple fingerprints The field immediately following the Simple Fingerprint Index is identical to that following the Composite Fingerprint Index: i.e., it is simply a list of the actual true-positives, false-positives, etc. For HELIX1N we find:- True positives.. 1CPV 1ABP 3CYT1 3C2C 155C 451C 1LZM 7LYZ 2SNS 3TLN 1SBT 2ACT 8PAP 3PGM 1TIM1 1GPD1 2GRS 1MBN 1ECD 2MHB1 2MHB2 2LHB 2LH3 5PTI 2ADK 1RHD In this instance, 26 true-positive codes are provided: no false-positives are given because the actual number is very large and, in this case, they are probably too many to be terribly informative. e) Protein titles i) Composite fingerprints Following the list of database codes that match the fingerprint, a field is provided in which the corresponding database titles are listed. For the KRINGLE fingerprint, the titles appear as follows: PROTEIN TITLES PLMN_HUMAN PLASMINOGEN PRECURSOR (EC - HOMO SAPIENS (HUM PLMN_MACMU PLASMINOGEN PRECURSOR (EC - MACACA MULATTA (R PLHU Plasmin (EC precursor - Human PLMN_PIG PLASMINOGEN (EC - SUS SCROFA (PIG). PLMN_BOVIN PLASMINOGEN (EC - BOS TAURUS (BOVINE). S01678 Plasminogen activator (EC precursor, tissue APOA_HUMAN APOLIPOPROTEIN(A) PRECURSOR (EC 3.4.21.-) (APO(A)) UROT_RAT TISSUE PLASMINOGEN ACTIVATOR PRECURSOR (EC APOA_MACMU APOLIPOPROTEIN(A) (EC 3.4.21.-) (APO(A)) (LP(A)) (FRAGME UROT_DESRO SALIVARY PLASMINOGEN ACTIVATOR PRECURSOR (EC JS0598 Plasminogen activator (EC alpha-2 precursor - JS0599 Plasminogen activator (EC beta precursor - A40522 *Plasminogen - Rat (fragment) UROK_HUMAN UROKINASE-TYPE PLASMINOGEN ACTIVATOR PRECURSOR (EC3.4.21 SYNMUTUPA SYNMUTUPA mutated urokinase-type plasminogen activator - A32974 *Plasminogen activator (EC precursor,urokinas JS0600 Plasminogen activator (EC gamma precursor - UROT_MOUSE TISSUE PLASMINOGEN ACTIVATOR PRECURSOR (EC MUSHGF1 MUSHGF1 LOCUS MUSHGF1 2204 bp ss-mRNA ROD 28-OCT-1992 - MUSHGF2 MUSHGF2 LOCUS MUSHGF2 2189 bp ss-mRNA ROD 28-OCT-1992 - HGF_RAT HEPATOCYTE GROWTH FACTOR PRECURSOR (HGF). - RATTUS NORVE PLMN_MOUSE PLASMINOGEN PRECURSOR (EC - MUS MUSCULUS (MOU HGF_HUMAN HEPATOCYTE GROWTH FACTOR PRECURSOR (SCATTER FACTOR) (SF) S06794 Hepatocyte growth factor precursor - Human UROK_PAPCY UROKINASE-TYPE PLASMINOGEN ACTIVATOR PRECURSOR (EC3.4.21 UROK_PIG UROKINASE-TYPE PLASMINOGEN ACTIVATOR PRECURSOR (EC3.4.21 UROT_HUMAN TISSUE PLASMINOGEN ACTIVATOR PRECURSOR (EC S18932 *Urikinase-type plasminogen activator - Rat S24604 *Urinary plasminogen activator - Rat JS0597 Plasminogen activator (EC alpha-1 precursor HUMUKPPE HUMUKPPE preprourokinase; (EC - Homo sapiens FA12_HUMAN COAGULATION FACTOR XII PRECURSOR (EC (HAGEMAN UROK_MOUSE UROKINASE-TYPE PLASMINOGEN ACTIVATOR PRECURSOR (EC3.4.21 N$1TPKA Plasminogen activator (EC (kringle 2 domain) N$1TPKB Plasminogen activator (EC (kringle 2 domain) N$1TPKC Plasminogen activator (EC (kringle 2 domain) THRB_BOVIN PROTHROMBIN PRECURSOR (EC - BOS TAURUS (BOVIN BOVTHBNM BOVTHBNM preprothrombin - Bos taurus HGFL_MOUSE HEPATOCYTE GROWTH FACTOR-LIKE PROTEIN PRECURSOR. - MUS THRB_HUMAN PROTHROMBIN PRECURSOR (EC (COAGULATION FACTOR HGFL_HUMAN HEPATOCYTE GROWTH FACTOR-LIKE PROTEIN PRECURSOR. - HOMO THRB_MOUSE PROTHROMBIN PRECURSOR (EC - MUS MUSCULUS (MOU THRB_RAT PROTHROMBIN PRECURSOR (EC - RATTUS NORVEGICUS PLMN_CANFA PLASMINOGEN (EC (FRAGMENT). - CANIS FAMILIARIS PLMN_HORSE PLASMINOGEN (EC (FRAGMENT). - EQUUS CABALLUS HUMROR1A HUMROR1A contains tyrosine kinase-like domain - Homo sap HUMROR2A HUMROR2A contains tyrosine kinase-like domain - Homo sap UROK_CHICK UROKINASE-TYPE PLASMINOGEN ACTIVATOR PRECURSOR (EC3.4.21 Thus we have 45 true-positive titles, followed by 3 "subfamily" titles (delineated by blank lines). ii) Simple fingerprints Just as for composite fingerprints, a list of all true-positive titles is provided for simple fingerprints - the format in each case is identical. E.g.:- PROTEIN TITLES 1CPV Calcium-binding Parvalbumin B - Carp (Cyprinus Carpio) 1ABP L-arabinose-binding Protein - (Escherichia Coli) 3CYT Cytochrome C (Oxidized) - Albacore Tuna (Thunnus Alalunga) 3C2C Cytochrome C2 (Reduced) - (Rhodospirillum Rubrum) 155C Cytochrome C550 - (Paracoccus Denitrificans) 451C Cytochrome $C551 (Reduced) - (Pseudomonas Aeruginosa) 1LZM Lysozyme - Bacteriophage T4 7LYZ Lysozyme - Hen (Gallus Gallus) Egg White 2SNS Staphylococcal Nuclease - (Staphylococcus Aureus) 3TLN Thermolysin - (Bacillus Thermoproteolyticus) 1SBT Subtilisin - Probably Bacillus Amyloliquefaciens 2ACT Actinidin - Chinese Gooseberry 8PAP Papain - Papaya (Carica Papaya) Fruit Latex 3PGM Phosphoglycerate Mutase - (Saccharomyces Cerevisiae 1TIM Triose Phosphate Isomerase - Chicken (Gallus Gallus) Breas 1GPD D-gyceraldehyde-3-phosphate Dehydrogenase - Lobster (Homar 2GRS Glutathione Reductase - Human (Homo Sapiens) Erythrocyte 1MBN Myoglobin (Ferric Iron - Metmyoglobin) - Sperm Whale (Phy 1ECD Hemoglobin (Erythrocruorin, Deoxy) - (Chironomous Thummi T 2MHB Hemoglobin (Aquo, Met) - Horse (Equus Caballus) 2LHB Hemoglobin V (Cyano, Met) - Sea lamprey (Petromyzon marinu 2LH3 Leghemoglobin (Cyano, Met) - Yellow Lupin (Lupinus Luteus 5PTI Trypsin Inhibitor (Crystal Form II) - Bovine (Bos Taurus) 2ADK Adenylate Kinase - Porcine (Sus Scrofa) Muscle 1RHD Rhodanese - Bovine (Bos Taurus) Liver This example shows the titles of all the true-positive sequences found by entry HELIX1N: there are 26 true-positives but 25 titles because in one instance the motif occurs in 2 different chains of the same protein (2MHB). f) Scan history i) Composite fingerprints For the purpose of documenting the evolution of a fingerprint, it is necessary to know the database on which the fingerprint was originally derived (i.e. the convergence database), in addition to the database, or databases, on which it has subsequently been refined (i.e. the update database/s). This information is contained within the "scan history": this also includes the number of iterations performed on each database; the length of hitlist used in the final analysis (i.e. that led to the actual database entry) - this is not necessarily the same as the working hitlist length, which may be many times longer; and the database scanning method used to derive the fingerprint. For example:- SCAN HISTORY OWL11_0 4 500 NSINGLE OWL17_1 1 250 NSINGLE OWL18_0 1 150 NSINGLE OWL19_1 1 150 NSINGLE The scan history for KRINGLE indicates that this is a mature fingerprint, converging originally after 4 iterations on OWL11.0, and refined on 3 subsequent releases of OWL. The NSINGLE scanning method was used throughout, and the analysis hitlist length has been reduced to 150 as the fingerprint has become more potent. ii) Simple fingerprints Again, as with composite fingerprints, the purpose and format of the scan history is the same. For example:- SCAN HISTORY KAS6089 4 100 NPAIR Thus, for HELIX1N the KAS6089 database was searched with the NPAIR scanning method, 4 iterations were performed until convergence, and the hitlist length was 100. g) Initial motif-sets i) Composite fingerprints To illustrate the starting point in the derivation of a fingerprint, the motifs excised from the initial alignment are provided in the "initial motif-set" field. Each motif is assigned a code, which is the general code plus the number of that motif; provided next are its length and a short title, which also indicates an iteration number (for initial motifs this is always 1). The motifs themselves follow, together with the protein identification codes of the initial sequences (PCODE), the location of the motifs within those sequences (ST), and the interval between adjacent motifs (INT) - for the first motif, this is simply the distance from the beginning of the sequence to the start of the motif. The initial motif-sets for KRINGLE are depicted below (for convenience, only the first and last are actually shown):- INITIAL MOTIF SETS KRINGLE1 Length of motif = 16 Motif number = 1 Kringle domain motif I - 1 PCODE ST INT CYEDQGISYRGTWSTA UKHUT 127 127 CYEDQGISYRGTWSTA HUMPAR 127 127 CYEDQGISYRGTWSTA HUMTPAR 81 81 CYHGDGQSYRGTSSTT PLMN$HUMAN 377 377 CYHGDGQSYRGTSSTT PLMN$MACMU 377 377 CYEGNGHFYRGKASTD EZEC572 50 50 CYDGRGLSYRGLARTT !F12HA 198 198 . . . . . . KRINGLE4 Length of motif = 12 Motif number = 4 Kringle domain motif IV - 1 PCODE ST INT KYSSEFCSTPAC UKHUT 197 4 KYSSEFCSTPAC HUMPAR 197 4 KYSSEFCSTPAC HUMTPAR 151 4 SVRWEYCNLKKC PLMN$HUMAN 443 3 SVRWEYCNLKKC PLMN$MACMU 443 3 KPLVQECMVHDC EZEC572 120 4 RLSWEYCDLAQC !F12HA 265 4 In this example, it can be seen that the initial alignment contained 7 sequences. The fingerprint comprises 4 elements, designated KRINGLE1 to KRINGLE4, each of which encodes part of the kringle domain. ii) Simple fingerprints The initial motifs for simple fingerprints follow exactly the same format as for composite fingerprints, with one minor but obvious difference - namely that, by virtue of being single-component fingerprints, they contain only one motif, and the interval between motifs is therefore irrelevant. The following example shows part of the "initial motif-set" field for HELIX1N: the "INT" parameter is left blank. INITIAL MOTIF SETS HELIX1N1 Length of motif = 9 Motif number = 1 Alpha-helix N-terminal type I - 1 PCODE ST INT AADMQGVVT 1AZU 50 AADNATAIA 1HIP 8 AAGDAGFEK 2LHB 128 ADAHFPVVK 2LH3 103 AEGSVDDVF 2ADK 175 . . . . h) Final motif-sets i) Composite fingerprints The "final motif-set" field contains the last and most important piece of information: it contains the final motifs of the converged, refined fingerprint. The field adheres to the same format as the "initial motif-set" field, the only difference being that it tends to contain more mature motifs. The final motifs for KRINGLE are shown below (again, only the first and last are shown):- FINAL MOTIF SETS KRINGLE1 Length of motif = 16 Motif number = 1 Kringle domain motif I - 7 PCODE ST INT CYHGNGQSYRGTSSTT PLMN_BOVIN 358 358 CYHGNGQSYRGTYSTT APOA_HUMAN 142 142 CYHGDGQSYRGTSSTT APOA_MACMU 954 954 CYHGDGQSYRGTSSTT PLMN_HUMAN 377 377 CYHGDGQSYRGTSSTT PLMN_MACMU 377 377 CYHGDGQSYRGTSSTT PLHU 357 357 CYRGNGESYRGTSSTT PLMN_PIG 358 358 CYQGNGKSYRGTSSTT A40522 34 34 CYEGQGVTYRGTWSTA JS0597 128 128 CYEDQGISYRGTWSTA UROT_HUMAN 127 127 CYEDQGISYRGTWSTA S01678 127 127 CYEGNGHFYRGKASTD UROK_HUMAN 70 70 CYEGNGHFYRGKASTD UROK_PAPCY 69 69 CYEGNGHFYRGKASTD HUMUKPPE 70 70 CYEGNGHFYRGKASTD SYNMUTUPA 69 69 CYQSDGQSYRGTSSTT PLMN_MOUSE 377 377 CFEGQGITYRGTWSTA UROT_RAT 124 124 CYEGNGMFYRGKASTD A32974 50 50 CYHGNGQSYRGKANTD S18932 70 70 CYHGNGQSYRGKANTD S24604 70 70 CYHGNGDSYRGKANTD UROK_MOUSE 71 71 CYKDQGVTYRGTWSTS UROT_DESRO 128 128 CYKDQGVTYRGTWSTS JS0598 128 128 CYKDQGVTYRGTWSTS JS0599 82 82 CYKDQGVTYRGTWSTS JS0600 45 45 CFEEQGITYRGTWSTA UROT_MOUSE 124 124 CFRGKGEGYRGTANTT HGFL_HUMAN 283 283 CFEGNGHSYRGKANTN UROK_PIG 72 72 CYDGRGLSYRGLARTT FA12_HUMAN 217 217 CIIGKGGSYKGTVSIT HGF_RAT 129 129 CIIGKGGSYKGTVSIT MUSHGF1 129 129 CIIGKGGSYKGTVSIT MUSHGF2 129 129 CIIGKGRSYKGTVSIT HGF_HUMAN 128 128 CIIGKGRSYKGTVSIT S06794 128 128 CFRGKGEDYRGTTNTT HGFL_MOUSE 292 292 CAEGVGMNYRGNVSVT THRB_BOVIN 109 109 CAEGVGMNYRGNVSVT BOVTHBNM 109 109 CMFGNGKGYRGKKATT PLMN_CANFA 4 4 CAEGLGTNYRGHVNIT THRB_HUMAN 108 108 CMLGIGKGYQGKKATT PLMN_HORSE 9 9 CAMDLGVNYLGTVNVT THRB_MOUSE 109 109 CAMDLGLNYHGNVSVT THRB_RAT 109 109 CYFGNGSAYRGTHSLT N$1TPKA 5 5 CYFGNGSAYRGTHSLT N$1TPKB 5 5 CYFGNGSAYRGTHSLT N$1TPKC 5 5 . . . . . KRINGLE4 Length of motif = 12 Motif number = 4 Kringle domain motif IV - 7 PCODE ST INT RVRWEFCNLKKC PLMN_BOVIN 424 3 SIRWEYCNLTRC APOA_HUMAN 4190 3 SVRREYCNLTRC APOA_MACMU 1134 3 SVRWEYCNLKKC PLMN_HUMAN 443 3 SVRWEYCNLKKC PLMN_MACMU 443 3 SVRWEYCNLKKC PLHU 423 3 RVRWEYCNLKKC PLMN_PIG 424 3 SVRWEYCNLKRC A40522 101 4 KFTSESCSVPVC JS0597 198 4 RLTWEYCDVPSC UROT_HUMAN 285 4 KYSSEFCSTPAC S01678 197 4 KPLVQECMVHDC UROK_HUMAN 140 4 KQRVQECMVHNC UROK_PAPCY 139 4 KPLVQECMVHDW HUMUKPPE 140 4 KPLVQECMVHDC SYNMUTUPA 139 4 SVRWEYCNLKRC PLMN_MOUSE 443 3 KYTTEFCSTPAC UROT_RAT 194 4 KPLVQECMVHDC A32974 120 4 KQFVQECMVQDC S18932 140 4 KQFVQECMVQDC S24604 140 4 RQFVQECMVHDC UROK_MOUSE 141 4 KFILEFCSVPVC UROT_DESRO 198 4 KFILEFCSVPVC JS0598 198 4 KFILEFCSVPVC JS0599 152 4 KFTSESCSVPVC JS0600 115 4 KYTTEFCSTPAC UROT_MOUSE 194 4 RTPFDYCALRRC HGFL_HUMAN 437 4 KQLVQECMVPNC UROK_PIG 142 4 RLSWEYCDLAQC FA12_HUMAN 284 4 DTPWEYCAIKMC HGF_RAT 278 3 DTPWEYCAIKTC MUSHGF1 278 3 DTPWEYCAIKTC MUSHGF2 273 3 HTRWEYCAIKTC HGF_HUMAN 277 3 HTRWEYCAIKTC S06794 277 3 DILFDYCALQRC HGFL_MOUSE 446 4 PGDFEYCDLNYC THRB_BOVIN 281 4 PGDFEYCDLNYC BOVTHBNM 281 4 RKLFDYCDVPQC PLMN_CANFA 72 4 PGDFGYCDLNYC THRB_HUMAN 280 4 QKLFDYCDVPQC PLMN_HORSE 77 4 PGDFEYCNLNYC THRB_MOUSE 282 4 QPGFEYCSLNYC THRB_RAT 281 3 RLTWEYCDVPSC N$1TPKA 75 4 RLTWEYCDVPSC N$1TPKB 75 4 RLTWEYCDVPSC N$1TPKC 75 4 Here, the final motifs contain sequence information from 45 sequences after iterative refinement - the title lines have been updated to indicate that this last scan represents a 7th iteration. ii) Simple fingerprints The "final motif-set" field for simple fingerprints is again identical to that for composite fingerprints. For example:- FINAL MOTIF SETS HELIX1N1 Length of motif = 9 Motif number = 1 Alpha-helix N-terminal type I - 4 PCODE ST INT AEGSVDDVF 2ADK 175 APLSAAEKT 2LHB 9 DGKMVNEAL 2SNS 95 FEKSPEELR 1RHD 221 GFIEEDELK 1CPV 56 GLTTADELK 2LHB 57 GVDAVSELS 1ABP 233 . . . . . Here we see part of the "final motif-set" field for HELIX1N. This represents the 4th and final iteration and contains all the true-positive sequences that contribute to the refined fingerprint. Note again that the "INT" parameter is irrelevant in this context and is simply left blank. 1.5 Associated software As mentioned at the outset, one of the principal aims in developing the PRINTS database is to aid sequence analysis. To this end, the database has been designed in such a way that entries contain all the information needed to (i) begin fingerprint refinement from scratch, using the initial motifs, and (ii) to make instant diagnoses of new sequences, using the final motifs, i.e. the refined fingerprints. Fingerprints may be extracted from the database using SMITE, the PRINTS database query language (see SMITE documentation), and then used with ADSP's PLOT and SCAN options, for single sequence and whole database searches respectively (see ADSP documentation). Conversely, the program EXFINGER allows a query sequence to be scanned very rapidly against either the entire fingerprint database or against a specified fingerprint (if you have a terminal that supports X-Windows, simply type `exfinger' at either a VMS or SG prompt). 1.6 The alignment database The process of deriving fingerprints relies on the generation of individual sequence alignments, from which sets of locally-aligned motifs are excised for database dredging. In order to capitalise on the effort in constructing the initial alignments, we have made available an alignment compendium to companion the fingerprint resource. All alignments are in NBRF format, each named according to the corresponding PRINTS identification code, and taking file extension .seqs. We make no claims for their "correctness" (a subjective notion at the best of times), but provide them in good faith as a guide to, or as an illustration of, the types of protein family contained in PRINTS. They may be of use to those wishing to augment the information contained in PRINTS, or they may be used as a convenient starting point for further analyses. The alignments should be accessible to any software that accepts NBRF format (e.g. SOMAP, VISTAS, XALIGN, etc.). 2.0 References 1. Attwood, T.K., Beck, M.E., Bleasby, A.J. and Parry-Smith, D.J. (1994) PRINTS - A database of protein motif fingerprints. Nucleic Acids Research, in press. 2. Attwood, T.K. and Beck, M.E. (1994) PRINTS - A protein motif finger- print database. Protein Engineering, 7 (7), 841-848. 3. Parry-Smith, D.J. and Attwood, T.K. (1992) ADSP - A new package for computational sequence analysis. CABIOS 8 (5) 451-459. 4. Bleasby, A.J., Akrigg, D. and Attwood, T.K. (1994) OWL - A non- redundant composite protein sequence database. Nucleic Acids Research, in press. 5. Bleasby, A.J. and Wootton, J.C. (1990) Construction of validated, non-redundant composite protein sequence databases. Protein Engineering 3 (3) 153-159. 6. Parry-Smith, D.J. and Attwood, T.K. (1991) SOMAP - A novel interactive approach to multiple protein sequence alignment. CABIOS 7 (2) 233-235. 7. Akrigg, D., Attwood, T.K, Bleasby, A.J., Findlay, J.B.C., Maughan, N.A., North, A.C.T., Parry-Smith, D.J., Perkins, D.N. and Wootton, J.C. (1992) SERPENT: An information storage and analysis resource for protein sequences. CABIOS, 8 (3), 295-296. 8. Perkins, D.N. and Attwood, T.K. (1994) VISTAS - A package for VIsualising STructures And Sequences of proteins. J.Mol.Graph., submitted. 2.1 Applications 1. Attwood, T.K. and Findlay, J.B.C. (1994) Fingerprinting G-Protein-Coupled Receptors. Protein Engineering, 7 (2), 195-203. 2. Attwood, T.K. and Findlay, J.B.C. (1993) Design of a discriminating fingerprint for G-protein-coupled receptors. Protein Engineering, 6 (2), 167-176. 3. Flower, D.R., North, A.C.T. and Attwood, T.K. (1993) Structure and Sequence Relationships in the Lipocalins and Related Proteins. Protein Science, 2, 753-761. 4. Boguski, M.S., Bairoch, A., Attwood, T.K. and Michaels, G.S. (1992) Proto-vav and Gene Expression. Nature, 358, 113. 5. Flower, D.R., North, A.C.T. and Attwood, T.K. (1991) Mouse oncogene protein 24p3 is a member of the Lipocalin protein family. Biochemical and Biophysical Research Communications, 180 (1), 69-74. ------ * ------