logo


DELPHOS 1 Introduction There are lots of interesting problems in information retrieval from molecular biological databases which have not previously been approached. These include:- (a) The problem of molecular biology linguistics. Difficulties of text retrieval arise because of the terminology of molecular biology and its usage. Complex queries with many parameters, perhaps including both text and sequence terms, are commonly needed. (b) Integration of the results of sequence similarity searches and text retrieval. Similarity searches based on sequence alignment algorithms, for example those using the Lipman-Pearson method (Pearson and Lipman, 1988), frequently retrieve only a subset of the proteins of interest. It is useful if further retrieval and removal of unwanted entries can use textual and other parameters. Similarly, the results of textual searches can usefully be strengthened by sequence similarity searches. (c) The creation of mini-databases for further research. Specialised subsets of the main database may be required as inputs to other programs, for example for multiple sequence alignment prior to modelling, searches by sequence alignment or pattern recognition. These and other possible applications demand considerable flexibility in both the query language and the means of controlling information flow. So far none of the standard available systems possess the combination of properties required in molecular biology. Relational databases have the necessary query flexibility through languages such as SQL but lack facilities for character string searching of long sequences and strings of text and for efficient integration with the scientific search methods. Other information retrieval packages offer relational text queries but fall down on criteria of integration and inappropriate text indexing. Integration could be achieved using the list processing facilities of high level languages such as LISP but the speed of retrieval would be slow. When DELPHOS was written the only available retrieval software designed specifically for sequence databases was the NBRF PSQ system (George et al., 1986), which is suitable only for simple queries and uses a very slow sequential scan method for text retrieval. Because of the inadequacies of existing software systems an integrated query and retrieval system has been designed and implemented de novo. The DELPHOS system combines some of the characteristics of relational database query methods, information retrieval systems and list processing languages, in a way which is appropriate for the manipulation of protein sequence data and related information for research in protein engineering and molecular biology. DELPHOS is a robust system providing rapid retrieval and manipulation of data from protein sequence databases. It has a flexible query language for retrieval from both textual and sequence areas of a database. Queries of any degree of complexity can be constructed and the results integrated by list manipulations. Results of such queries can also be integrated with the hit lists derived from sequence similarity searches using SWEEP. Entries retrieved by these search methods can readily be saved as disc files of specialised mini-databases to facilitate further research. The index files, used for both sequence and text retrieval, permit simple, flexible parameter specification and ensure very rapid data retrieval. Simple queries take only a few seconds of cpu time on a VAX minicomputer. Highly complex queries involving relational operations and list integrations rarely take more than a minute. Only fuzzy sequence searches generate `comparatively' slow searches as these have to be performed by a serial scan of the database. This combination of properties makes DELPHOS a useful tool for increasing the accessibility of data and knowledge relating to protein sequences. It can be employed in a wide range of applications, from browsing of the database in an exploratory fashion using simple text and sequence strings, to creation of specialised subsets of the database with very high precision and completeness using integration of sequence similarity searches and text retrieval. It can also be used to assist analysis of terminology in molecular biology and thesaurus construction. 2 Using DELPHOS. The DELPHOS language is easy to understand but formal descriptions of query languages tend to be rather difficult to absorb. For this reason the formal description is deferred to later in this document and a tutorial-style is given immediately to show how simple the query language is to use. An example of the use of DELPHOS in a practical example is given later. 2.1 How to enter DELPHOS This depends on how your computer manager has set the system up. There will generally be site-specific commands you will need to type to run DELPHOS; ask your system manager what this is or refer to local documentation. When DELPHOS is invoked it will display a summary of the current version of the OWL database and then present you with the DELPHOS> prompt. In the rest of this documentation all queries will be shown with this prompt, it is reproduced solely for clarity and you do not need to type it. The examples in this documentation are based on a test sequence database. 2.2 A sample sequence search Lets take the search for a directly matching peptide as our first example. The SEQ function is used for this. Type DELPHOS> display seq "sgksir" (fig 1) Figure No.1 WORKLIST ENTRIES (14): DEHUAA Alcohol dehydrogenase (EC 1.1.1.1) alpha chain - Human DEHUAB Alcohol dehydrogenase (EC 1.1.1.1) beta-1 chain - Human DEHUAG Alcohol dehydrogenase (EC 1.1.1.1) gamma-1 chain - Human DEMSAA Alcohol dehydrogenase (EC 1.1.1.1) A chain - Mouse ADH_PAPHA ALCOHOL DEHYDROGENASE (EC 1.1.1.1) (ADH). - BABOON (PAPIO HAMADRYAS). ADHG_HUMAN ALCOHOL DEHYDROGENASE GAMMA CHAIN (EC 1.1.1.1) (GENE NAME: ADH3). - HUMAN (HOMO SAPIENS). ADHS_HORSE ALCOHOL DEHYDROGENASE S CHAIN (EC 1.1.1.1). - HORSE (EQUUS CABALLUS). ADHX_HUMAN ALCOHOL DEHYDROGENASE CLASS III CHI CHAIN (EC 1.1.1.1) (GENE NAME: ADH5). - HUMAN (HOMO SAPIENS). HUMADH21C HUMADH21C Human class I alcohol dehydrogenase beta-1 subunit, allele 1 mRNA, complete cds. - Homo sapiens Eukaryota HUMADH2BA HUMADH2BA Human class I alcohol dehydrogenase (ADH2) beta-1 subunit mRNA, complete cds. - Homo sapiens Eukaryota HUMADH2C2 HUMADH2C2 Human class I alcohol dehydrogenase (ADH2) beta-1 subunit mRNA, complete cds. - Homo sapiens Eukaryota HUMADH3G2 HUMADH3G2 Human class I alcohol dehydrogenase (ADH3) gamma subunit, allele 2 mRNA, complete cds. - Homo sapiens Eukaryota HUMADH5C3 HUMADH5C3 Human alcohol dehydrogenase class III (ADH5) mRNA, complete cds. - Homo sapiens Eukaryota ADHB_HUMAN ALCOHOL DEHYDROGENASE BETA CHAIN (EC 1.1.1.1). - HOMO SAPIENS ( HUMAN). This query finds and gives brief information about all sequences in the database which contain the hexapeptide ser-gly-lys-ser-ile-arg. The first word `display' is the `command' used in DELPHOS to say "show on the screen the results of the following query". The second word `seq' is the start of the query and is the `function' used to find exact sequence matches. Any function like `seq' is always followed by a parameter which tells DELPHOS what you want the function to find, in the above case this is the ALANT peptide. Parameter string are always bounded by double quotes. The query given above produces the output shown in figure 1. At the top it says there are XX entries in the `worklist'. Lists will be described more fully later. In the meantime the worklist can be regarded as a store cupboard within DELPHOS where the results of your query are kept. This means that, if you want to see the results of your query again, all you need do is type DELPHOS> display Typing `display' on its own always shows what is in the worklist. Try it. What does the output mean? By default the `display' command shows only brief information about proteins which match your query. This information consists of the protein code (pcode) and a summary line. Every sequence in the database must have a unique identifying name; this name is the pcode. For example, the pcode for ovine opsin is `OOSH', similarly the pcode for Factor VIII is `EZHU'. Whenever you want to explicitly refer to a particular protein you use its pcode. The summary line, also called the title line, tells you the English name of the protein and often tells you the source e.g. `Cytochrome C - Human'. DELPHOS does, of course, allow you to display protein information in detail. How to display sequences and bibliographic information is described later. 2.3 A sample title search Something you'll frequently want to do is to find a protein, or group of proteins, in the database given its name. There are two ways of doing this in DELPHOS. The first uses the `title' function; the second method which uses the `text' function is described in the next section. Suppose you want to find all opsins in the database. Type DELPHOS> display title "opsin" (fig 2) Figure No.2 WORKLIST ENTRIES (11): OOBO Rhodopsin - Bovine OOFF Rhodopsin - Fruit fly OOFF2 Opsin 2 - Fruit fly OOHUB Blue-sensitive opsin - Human OOHUR Red-sensitive opsin - Human OOHUG Green-sensitive opsin - Human OPS3_DROME OPSIN RH3 (INNER R7 PHOTORECEPTOR CELLS OPSIN) (GENE NAME: RH3 OR RH92CD). - FRUIT FLY (DROSOPHILA MELANOGASTER). OPS4_DROME OPSIN RH4 (INNER R7 PHOTORECEPTOR CELLS OPSIN) (GENE NAME: RH4). - FRUIT FLY (DROSOPHILA MELANOGASTER). OPSD_HUMAN RHODOPSIN. - HUMAN (HOMO SAPIENS). OPSD_MOUSE RHODOPSIN. - MOUSE (MUS MUSCULUS). OPSD_OCTDO RHODOPSIN. - GIANT OCTOPUS (OCTOPUS DOFLEINI). The output is shown in figure 2. The title function searches all the title (summary) lines for any occurences of the character string `opsin'. The search does not differentiate between upper and lower case letters so it will find for example both `OPSIN' and `oPsIn' given the query above. Note that the database text describing a protein may contain the word `opsin' within it but, if the word doesn't appear in the title line, the `title' function will not find it. DELPHOS treats textual information in a special way. It allows you to search for any alphanumeric characters (A-Z, a-z and 0-9) but ignores all other characters (e.g. % & - $ etc). It even ignores any space characters within the database. This is called `free text searching' and has many advantages. As an example type DELPHOS> display title "p450" (fig 3) Figure No.3 WORKLIST ENTRIES (10): O4HU6 Cytochrome P450IA1 - Human (fragment) O4RBM4 Cytochrome P450IA2, isosafrole-inducible - Rabbit (fragments) CP41_RAT CYTOCHROME P450 IVA1 (P-450-LA-OMEGA) (LAURIC ACID OMEGA- HYDROXYLASE) (EC 1.14.15.3) (P452) (GENE NAME: CYP4A1). - RAT ( RATTUS NORVEGICUS). CPAX_HUMAN CYTOCHROME P450 IIA (EC 1.14.14.1). - HUMAN (HOMO SAPIENS). CPD1_RAT CYTOCHROME P450 IID1 (P450 DB1) (P450 CMF1A) (DEBRISOQUINE 4- HYDROXYLASE) (EC 1.14.14.1) (GENE NAME: CYP2D1). - RAT (RATTUS NORVEGICUS). CPM1_PIG CYTOCHROME P450 XIA1 (P450(SCC)) (EC 1.14.15.6), MITOCHONDRIAL ( CHOLESTEROL SIDE-CHAIN CLEAVAGE ENZYME) (GENE NAME: CYP11A1). - PIG (SUS SCROFA). RABCT450G RABCT450G Rabbit cytochrome P-450Bc2 DNA, 5' flanking region. - Oryctolagus cuniculus Eukaryota RATP45GMS RATP45GMS Rat polymorphic, male-specific cytochrome P-450g mRNA, complete cds. - Rattus norvegicus Eukaryota PAHDN3 PAHDN3 Plasmid pAH-delta-N3, junction area between the ADH1 promoter and cytochrome P-450 (pHP3) cDNA. - Artificial gene Artificial sequences NRL_2CPP1 CYTOCHROME P450CAM (CAMPHOR MONOOXYGENASE) (E.C.1.14.15.1) WITH BOUND CAMPHOR - (PSEUDOMONAS PUTIDA) You'll see that it picks up proteins with both `p450' and `p-450' (fig 3) in their title lines. This is because it ignores the hyphen character. In other query languages you'd have had to use two queries to pick up all the proteins you wanted. Another example of the effectiveness of free text searching would be `Cytochrome C'. There is obviously no problem in searching for `cytochrome' but the letter `C' presents some difficulty. If you gave other query languages this string they would certainly pick up any proteins which contained `cytochrome' but would be happy if the title line also contained a letter `C' anywhere within it i.e. not immediately after the word `cytochrome'. Because DELPHOS ignores spaces the query can be formulated precisely using DELPHOS> display title "cytochromec" In the last example the `c' will always be in the right place! Another example of free text searching is that you don't have to give an entire word, just a bit of it will do. An example would be DELPHOS> display title "osai" (fig 4) Figure No.4 WORKLIST ENTRIES (13): Y1_FMV HYPOTHETICAL PROTEIN 1. - FIGWORT MOSAIC VIRUS (FMV). Y16K_BGMV HYPOTHETICAL 15.6 KD PROTEIN. - BEAN GOLDEN MOSAIC VIRUS. Y30K_BGMV POTENTIAL 29.7 KD PROTEIN (PUTATIVE INSECT TRANSMISSION PRODUCT). - BEAN GOLDEN MOSAIC VIRUS. Y40K_BGMV HYPOTHETICAL 40.2 KD PROTEIN. - BEAN GOLDEN MOSAIC VIRUS. MBGBCG MBGBCG Bean golden mosaic virus (BGMV), DNA B, complete sequence. - Bean golden mosaic virus Viridae MBGBCG1 MBGBCG Bean golden mosaic virus (BGMV), DNA B, complete sequence. - Bean golden mosaic virus Viridae MBGBCG2 MBGBCG Bean golden mosaic virus (BGMV), DNA B, complete sequence. - Bean golden mosaic virus Viridae MBGBCG3 MBGBCG Bean golden mosaic virus (BGMV), DNA B, complete sequence. - Bean golden mosaic virus Viridae MBGBCG5 MBGBCG Bean golden mosaic virus (BGMV), DNA B, complete sequence. - Bean golden mosaic virus Viridae MTGMVS1 MTGMVS1 Tomato golden mosaic virus subgenomic DNA derived from DNA B cccds = covalently closed circular double-stranded molecule. - Tomato golden mosaic virus Viridae JU0041 Hypothetical 14.5K protein - Chloris striate mosaic virus (CSMV) JU0044 Hypothetical 15.8K protein - Chloris striate mosaic virus (CSMV) JU0043 Hypothetical 33.2K protein - Chloris striate mosaic virus (CSMV) which will pick up, for example, all the entries with `mosaic' in the title line (fig 4). Furthermore DELPHOS> display title "iatemosaicvi" (fig 5) Figure No.5 WORKLIST ENTRIES (3): JU0041 Hypothetical 14.5K protein - Chloris striate mosaic virus (CSMV) JU0044 Hypothetical 15.8K protein - Chloris striate mosaic virus (CSMV) JU0043 Hypothetical 33.2K protein - Chloris striate mosaic virus (CSMV) will pick up all those entries containing `strIATE MOSAIC VIrus'! (fig 5) 2.4 An example text search The DELPHOS `title' function restricts the search to the title (summary) lines; the `text' function doesn't. The scope of the `text' function is the ENTIRE title, bibliographic, comment and feature information within the database. Because DELPHOS is so fast the speed of retrieval using `text' is virtually distinguishable from `title' searches. Try the query DELPHOS> display text "opsin" (fig 6) Figure No.6 WORKLIST ENTRIES (15): OOBO Rhodopsin - Bovine OOFF Rhodopsin - Fruit fly OOFF2 Opsin 2 - Fruit fly OOHUB Blue-sensitive opsin - Human OOHUR Red-sensitive opsin - Human OOHUG Green-sensitive opsin - Human QRHYB2 Beta-2-adrenergic receptor - Hamster GBT1_BOVIN GUANINE NUCLEOTIDE-BINDING PROTEIN G(T), ALPHA-1 SUBUNIT ( TRANSDUCIN ALPHA-1 CHAIN). - BOVINE (BOS TAURUS). GBT1_HUMAN GUANINE NUCLEOTIDE-BINDING PROTEIN G(T), ALPHA-1 SUBUNIT ( TRANSDUCIN ALPHA-1 CHAIN) (GENE NAME: GNAT1). - HUMAN (HOMO SAPIENS). GBT2_BOVIN GUANINE NUCLEOTIDE-BINDING PROTEIN G(T), ALPHA-2 SUBUNIT ( TRANSDUCIN ALPHA-2 CHAIN). - BOVINE (BOS TAURUS). OPS3_DROME OPSIN RH3 (INNER R7 PHOTORECEPTOR CELLS OPSIN) (GENE NAME: RH3 OR RH92CD). - FRUIT FLY (DROSOPHILA MELANOGASTER). OPS4_DROME OPSIN RH4 (INNER R7 PHOTORECEPTOR CELLS OPSIN) (GENE NAME: RH4). - FRUIT FLY (DROSOPHILA MELANOGASTER). OPSD_HUMAN RHODOPSIN. - HUMAN (HOMO SAPIENS). OPSD_MOUSE RHODOPSIN. - MOUSE (MUS MUSCULUS). OPSD_OCTDO RHODOPSIN. - GIANT OCTOPUS (OCTOPUS DOFLEINI). The output from this query is shown in figure 6. Again just the title lines of matching entries are shown. But wait! What is entry QRHYB2 doing there? The title line doesn't contain the word `opsin'! The answer of course is that `opsin' appears elsewhere in the text for that protein. We'll deal with all the functionality of the display command later but for now, just to prove QRHYB2 does contain `opsin' type (fig 7) DELPHOS> display/comment Figure No.7 WORKLIST ENTRIES (15): OOBO Rhodopsin - Bovine Species: Bos primigenius taurus (cattle) Accession: A03154 Introns: 121/1, 177/2, 232/3, 312/3 Superfamily: vertebrate rhodopsin Keywords: photoreceptor; chromoprotein; glycoprotein; acetylation; transmembrane protein 1/Modified site: acetylated amino end 2,15/Binding site: carbohydrate (Asn) 296/Binding site: retinal chromophore OOFF Rhodopsin - Fruit fly Species: Drosophila melanogaster Accession: A22012 The domains were proposed from hydropathy indices. Some or all of the carboxyl-terminal Ser or Thr residues may be phosphorylated. Map position: 3R66 (92B8-11) Gene name: ninaE Introns: 3/2, 190/2, 239/3, 332/2 Superfamily: vertebrate rhodopsin Keywords: photoreceptor; chromoprotein; transmembrane protein 1-49/Domain: extracellular I <EX1> 50-74/Domain: transmembrane I <TM1> 75-86/Domain: intracellular I <IN1> 87-109/Domain: transmembrane II <TM2> 110-127/Domain: extracellular II <EX2> 128-153/Domain: transmembrane III <TM3> 154-160/Domain: intracellular II <IN2> 161-181/Domain: transmembrane IV <TM4> 182-215/Domain: extracellular III <EX3> 216-243/Domain: transmembrane V <TM5> 244-276/Domain: intracellular III <IN3> 277-300/Domain: transmembrane VI <TM6> 301-308/Domain: extracellular IV <EX4> 309-332/Domain: transmembrane VII <TM7> 333-373/Domain: intracellular IV <IN4> OOFF2 Opsin 2 - Fruit fly Species: Drosophila melanogaster Accession: A24058 This protein is specifically expressed in photoreceptor cell R8 of the Drosophila compound eye. Map position: 3R (91D1-2) Gene name: Rh2 Introns: 33/3, 339/2, 350/3 Superfamily: vertebrate rhodopsin Keywords: photoreceptor; chromophore; transmembrane protein 326/Binding site: retinal chromophore (Lys) (by homology) 1-56/Domain: extracellular 1 (by homology) <EX1> 57-81/Domain: transmembrane 1 (by homology) <TM1> 82-93/Domain: intracellular 1 (by homology) <IN1> 94-116/Domain: transmembrane 2 (by homology) <TM2> 117-134/Domain: extracellular 2 (by homology) <EX2> 135-160/Domain: transmembrane 3 (by homology) <TM3> 161-167/Domain: intracellular 2 (by homology) <IN2> 168-188/Domain: transmembrane 4 (by homology) <TM4> 189-222/Domain: extracellular 3 (by homology) <EX3> 223-250/Domain: transmembrane 5 (by homology) <TM5> 251-283/Domain: intracellular 3 (by homology) <IN3> 284-307/Domain: transmembrane 6 (by homology) <TM6> 308-315/Domain: extracellular 4 (by homology) <EX4> 316-339/Domain: transmembrane 7 (by homology) <TM7> 340-381/Domain: intracellular 4 (by homology) <IN4> OOHUB Blue-sensitive opsin - Human Species: Homo sapiens (man) Accession: A03156 The source of this protein is retinal cones. Map position: 7q22-qter Gene name: BCP Introns: 118/1, 174/2, 229/3, 309/3 Superfamily: vertebrate rhodopsin Keywords: color vision; membrane protein 34-57,71-94,113-136,149-172,198-221,250-273,283-306/Region: transmembrane segment (probable) OOHUR Red-sensitive opsin - Human Species: Homo sapiens (man) Accession: A03157 The source of this protein is retinal cones. Map position: Xq22-qter Gene name: RCP Introns: 38/1, 137/1, 193/2, 248/3, 328/3 Superfamily: vertebrate rhodopsin Keywords: color vision; sex-linked inheritance; membrane protein 53-76,90-113,132-155,168-191,217-240,269-292,302-325/Region: transmembrane segment (probable) OOHUG Green-sensitive opsin - Human Species: Homo sapiens (man) Accession: A03158 The source of this protein is retinal cones. Map position: Xq22-q28 Gene name: GCP Introns: 38/1, 137/1, 193/2, 248/3, 328/3 Superfamily: vertebrate rhodopsin Keywords: color vision; sex-linked inheritance; membrane protein 53-76,90-113,132-155,168-191,217-240,269-292,302-325/Region: transmembrane segment (probable) QRHYB2 Beta-2-adrenergic receptor - Hamster Species: Cricetinae gen. sp. (hamster) Accession: A03159 This protein may have up to seven hydrophobic membrane-spanning helices, as does rhodopsin, but the exact limits have not yet been determined. This protein was isolated from the lung. Superfamily: vertebrate rhodopsin Keywords: transmembrane protein; glycoprotein; receptor; lung; phosphoprotein; rhodopsin homolog 6,15/Binding site: carbohydrate (Asn) (putative) 261,262,345,346,347/Binding site: phosphate (putative) GBT1_BOVIN GUANINE NUCLEOTIDE-BINDING PROTEIN G(T), ALPHA-1 SUBUNIT ( TRANSDUCIN ALPHA-1 CHAIN). - BOVINE (BOS TAURUS). Species: BOVINE (BOS TAURUS). Accession: P04695 13-AUG-1987 (REL. 05, CREATED) 13-AUG-1987 (REL. 05, LAST SEQUENCE UPDATE) 01-NOV-1988 (REL. 09, LAST ANNOTATION UPDATE) -!- FUNCTION: GUANINE NUCLEOTIDE-BINDING PROTEINS (G PROTEINS) ARE INVOLVED AS A MODULATOR OR TRANSDUCER IN VARIOUS TRANSMEMBRANE SIGNALING SYSTEMS. -!- FUNCTION: TRANSDUCIN IS AN AMPLIFIER AND ONE OF THE TRANSDUCERS OF A VISUAL IMPULSE THAT PERFORMS THE COUPLING BETWEEN RHODOPSIN AND CGMP-PHSOPHODIESTERASE. -!- SUBUNIT: G PROTEIN ARE COMPOSED OF 3 UNITS (ALPHA, BETA & GAMMA). THE BETA AND GAMMA UNITS APPEAR TO BE COMMON TO ALL G PROTEINS. -!- TRANSDUCIN ALPHA-1 CHAIN IS FOUND IN ROD. EMBL; K03253; BTTRA. EMBL; K03254; BTTRNAM. EMBL; X02440; BTTRDAR. BINDING 174 174 ADP-RIBOSE (BY ACTION OF CHOLERA TOXIN). BINDING 347 347 ADP-RIBOSE (BY ACTION OF IAP). NP_BIND 31 50 GTP (PROBABLE). NP_BIND 80 101 GTP (PROBABLE). NP_BIND 113 116 GTP (PROBABLE). NP_BIND 208 222 GTP (PROBABLE). NP_BIND 265 268 GTP (PROBABLE). Keywords: GTP-BINDING; TRANSDUCER; MULTIGENE FAMILY. GBT1_HUMAN GUANINE NUCLEOTIDE-BINDING PROTEIN G(T), ALPHA-1 SUBUNIT ( TRANSDUCIN ALPHA-1 CHAIN) (GENE NAME: GNAT1). - HUMAN (HOMO SAPIENS). Species: HUMAN (HOMO SAPIENS). Accession: P11488 01-OCT-1989 (REL. 12, CREATED) 01-OCT-1989 (REL. 12, LAST SEQUENCE UPDATE) 01-OCT-1989 (REL. 12, LAST ANNOTATION UPDATE) -!- FUNCTION: GUANINE NUCLEOTIDE-BINDING PROTEINS (G PROTEINS) ARE INVOLVED AS A MODULATOR OR TRANSDUCER IN VARIOUS TRANSMEMBRANE SIGNALING SYSTEMS. -!- FUNCTION: TRANSDUCIN IS AN AMPLIFIER AND ONE OF THE TRANSDUCERS OF A VISUAL IMPULSE THAT PERFORMS THE COUPLING BETWEEN RHODOPSIN AND CGMP-PHSOPHODIESTERASE. -!- SUBUNIT: G PROTEIN ARE COMPOSED OF 3 UNITS (ALPHA, BETA & GAMMA). THE BETA AND GAMMA UNITS APPEAR TO BE COMMON TO ALL G PROTEINS. -!- TRANSDUCIN ALPHA-1 CHAIN IS FOUND IN ROD. EMBL; X15088; HSGNAT1. BINDING 174 174 ADP-RIBOSE (BY ACTION OF CHOLERA TOXIN). BINDING 347 347 ADP-RIBOSE (BY ACTION OF IAP). NP_BIND 31 50 GTP (PROBABLE). NP_BIND 80 101 GTP (PROBABLE). NP_BIND 113 116 GTP (PROBABLE). NP_BIND 208 222 GTP (PROBABLE). NP_BIND 265 268 GTP (PROBABLE). Keywords: GTP-BINDING; TRANSDUCER; MULTIGENE FAMILY. GBT2_BOVIN GUANINE NUCLEOTIDE-BINDING PROTEIN G(T), ALPHA-2 SUBUNIT ( TRANSDUCIN ALPHA-2 CHAIN). - BOVINE (BOS TAURUS). Species: BOVINE (BOS TAURUS). Accession: P04696 13-AUG-1987 (REL. 05, CREATED) 13-AUG-1987 (REL. 05, LAST SEQUENCE UPDATE) 01-NOV-1988 (REL. 09, LAST ANNOTATION UPDATE) -!- FUNCTION: GUANINE NUCLEOTIDE-BINDING PROTEINS (G PROTEINS) ARE INVOLVED AS A MODULATOR OR TRANSDUCER IN VARIOUS TRANSMEMBRANE SIGNALING SYSTEMS -!- FUNCTION: TRANSDUCIN IS AN AMPLIFIER AND ONE OF THE TRANSDUCERS OF A VISUAL IMPULSE THAT PERFORMS THE COUPLING BETWEEN RHODOPSIN AND CGMP-PHSOPHODIESTERASE. -!- SUBUNIT: G PROTEIN ARE COMPOSED OF 3 UNITS (ALPHA, BETA & GAMMA). THE BETA AND GAMMA UNITS APPEAR TO BE COMMON TO ALL G PROTEINS. -!- TRANSDUCIN ALPHA-2 CHAIN IS FOUND IN OUTER SEGMENTS. EMBL; M11116; BTNA2. MOD_RES 2 2 ACETYLATION (BY HOMOLOGY WITH RAS). BINDING 178 178 ADP-RIBOSE (BY ACTION OF CHOLERA TOXIN). BINDING 351 351 ADP-RIBOSE (BY ACTION OF IAP). NP_BIND 260 276 GTP (PROBABLE). Keywords: GTP-BINDING; TRANSDUCER; MULTIGENE FAMILY. OPS3_DROME OPSIN RH3 (INNER R7 PHOTORECEPTOR CELLS OPSIN) (GENE NAME: RH3 OR RH92CD). - FRUIT FLY (DROSOPHILA MELANOGASTER). Species: FRUIT FLY (DROSOPHILA MELANOGASTER). Accession: P04950 13-AUG-1987 (REL. 05, CREATED) 13-AUG-1987 (REL. 05, LAST SEQUENCE UPDATE) 01-JAN-1990 (REL. 13, LAST ANNOTATION UPDATE) -!- FUNCTION: VISUAL PIGMENTS ARE THE LIGHT-ABSORBING MOLECULES THAT MEDIATE VISION. THEY CONSIST OF AN APOPROTEIN, OPSIN, COVALENTLY LINKED TO CIS-RETINAL. -!- EACH DROSOPHILA EYE IS COMPOSED OF 800 FACETS OR OMMATIDIA. EACH OMMATIDIUM CONTAINS 8 PHOTORECEPTOR CELLS (R1-R8), THE R1 TO R6 CELLS ARE OUTER CELLS, WHILE R7 AND R8 ARE INNER CELLS. -!- OPSIN RH3 IS SENSITIVE TO UV LIGHT. -!- SOME OR ALL OF THE CARBOXYL-TERMINAL SER OR THR RESIDUES MAY BE PHOSPHORYLATED. -!- SIMILARITY: TO ALL OTHER G-PROTEIN COUPLED RECEPTORS. EMBL; Y00043; DMRH92CD. EMBL; M17718; DMRH3A. PROSITE; PS00237; G_PROTEIN_RECEPTOR. PROSITE; PS00238; OPSIN. CARBOHYD 13 13 PROBABLE. BINDING 328 328 RETINAL CHROMOPHORE. DOMAIN 1 62 EXTRACELLULAR. TRANSMEM 63 83 DOMAIN 84 95 CYTOPLASMIC. TRANSMEM 96 115 DOMAIN 116 130 EXTRACELLULAR. TRANSMEM 131 151 DOMAIN 152 171 CYTOPLASMIC. TRANSMEM 172 192 DOMAIN 193 219 EXTRACELLULAR. TRANSMEM 220 240 DOMAIN 241 288 CYTOPLASMIC. TRANSMEM 289 309 DOMAIN 310 319 EXTRACELLULAR. TRANSMEM 320 340 DOMAIN 341 383 CYTOPLASMIC. Keywords: PHOTORECEPTOR; RETINAL PROTEIN; TRANSMEMBRANE; PHOSPHORYLATION; GLYCOPROTEIN; G-PROTEIN COUPLED RECEPTOR; VISION. OPS4_DROME OPSIN RH4 (INNER R7 PHOTORECEPTOR CELLS OPSIN) (GENE NAME: RH4). - FRUIT FLY (DROSOPHILA MELANOGASTER). Species: FRUIT FLY (DROSOPHILA MELANOGASTER). Accession: P08255 01-AUG-1988 (REL. 08, CREATED) 01-AUG-1988 (REL. 08, LAST SEQUENCE UPDATE) 01-JAN-1990 (REL. 13, LAST ANNOTATION UPDATE) -!- FUNCTION: VISUAL PIGMENTS ARE THE LIGHT-ABSORBING MOLECULES THAT MEDIATE VISION. THEY CONSIST OF AN APOPROTEIN, OPSIN, COVALENTLY LINKED TO CIS-RETINAL. -!- EACH DROSOPHILA EYE IS COMPOSED OF 800 FACETS OR OMMATIDIA. EACH OMMATIDIUM CONTAINS 8 PHOTORECEPTOR CELLS (R1-R8), THE R1 TO R6 CELLS ARE OUTER CELLS, WHILE R7 AND R8 ARE INNER CELLS. -!- OPSIN RH4 IS SENSITIVE TO UV LIGHT. -!- SOME OR ALL OF THE CARBOXYL-TERMINAL SER OR THR RESIDUES MAY BE PHOSPHORYLATED. -!- SIMILARITY: TO ALL OTHER G-PROTEIN COUPLED RECEPTORS. EMBL; M17719; DMRH4A1. EMBL; M17730; DMRH4A2. PROSITE; PS00237; G_PROTEIN_RECEPTOR. PROSITE; PS00238; OPSIN. CARBOHYD 6 6 PROBABLE. BINDING 324 324 RETINAL CHROMOPHORE. DOMAIN 1 58 EXTRACELLULAR. TRANSMEM 59 79 DOMAIN 80 91 CYTOPLASMIC. TRANSMEM 92 111 DOMAIN 112 126 EXTRACELLULAR. TRANSMEM 127 147 DOMAIN 148 167 CYTOPLASMIC. TRANSMEM 168 188 DOMAIN 189 215 EXTRACELLULAR. TRANSMEM 216 236 DOMAIN 237 284 CYTOPLASMIC. TRANSMEM 285 305 DOMAIN 306 315 EXTRACELLULAR. TRANSMEM 316 336 DOMAIN 337 378 CYTOPLASMIC. Keywords: PHOTORECEPTOR; RETINAL PROTEIN; TRANSMEMBRANE; PHOSPHORYLATION; GLYCOPROTEIN; G-PROTEIN COUPLED RECEPTOR; VISION. OPSD_HUMAN RHODOPSIN. - HUMAN (HOMO SAPIENS). Species: HUMAN (HOMO SAPIENS). Accession: P08100 01-AUG-1988 (REL. 08, CREATED) 01-AUG-1988 (REL. 08, LAST SEQUENCE UPDATE) 01-JAN-1990 (REL. 13, LAST ANNOTATION UPDATE) -!- FUNCTION: VISUAL PIGMENTS ARE THE LIGHT-ABSORBING MOLECULES THAT MEDIATE VISION. THEY CONSIST OF AN APOPROTEIN, OPSIN, COVALENTLY LINKED TO CIS-RETINAL. -!- RHODOPSIN IS FOUND IN ROD SHAPED PHOTORECEPTOR CELLS WHICH MEDIATES VISION IN DIM LIGHT. -!- RHODOPSIN HAS AN ABSORPTION MAXIMA AT 495 NM. -!- SOME OR ALL OF THE CARBOXYL-TERMINAL SER OR THR RESIDUES MAY BE PHOSPHORYLATED. -!- SIMILARITY: TO ALL OTHER G-PROTEIN COUPLED RECEPTORS. EMBL; K02281; HSOPS. PROSITE; PS00237; G_PROTEIN_RECEPTOR. PROSITE; PS00238; OPSIN. MOD_RES 1 1 ACETYLATION (BY HOMOLOGY). CARBOHYD 2 2 BY HOMOLOGY. CARBOHYD 15 15 BY HOMOLOGY. BINDING 296 296 RETINAL CHROMOPHORE. BINDING 322 322 PALMITYL (BY HOMOLOGY). BINDING 323 323 PALMITYL (BY HOMOLOGY). DOMAIN 1 36 EXTRACELLULAR. TRANSMEM 37 61 DOMAIN 62 73 CYTOPLASMIC. TRANSMEM 74 98 DOMAIN 99 113 EXTRACELLULAR. TRANSMEM 114 140 DOMAIN 141 152 CYTOPLASMIC. TRANSMEM 153 176 DOMAIN 173 202 EXTRACELLULAR. TRANSMEM 203 230 DOMAIN 231 252 CYTOPLASMIC. TRANSMEM 253 276 DOMAIN 277 284 EXTRACELLULAR. TRANSMEM 285 309 DOMAIN 310 348 CYTOPLASMIC. Keywords: PHOTORECEPTOR; RETINAL PROTEIN; TRANSMEMBRANE; GLYCOPROTEIN; VISION; PHOSPHORYLATION; LIPOPROTEIN; ACETYLATION; G-PROTEIN COUPLED RECEPTOR. OPSD_MOUSE RHODOPSIN. - MOUSE (MUS MUSCULUS). Species: MOUSE (MUS MUSCULUS). Accession: P15409 01-APR-1990 (REL. 14, CREATED) 01-APR-1990 (REL. 14, LAST SEQUENCE UPDATE) 01-APR-1990 (REL. 14, LAST ANNOTATION UPDATE) -!- FUNCTION: VISUAL PIGMENTS ARE THE LIGHT-ABSORBING MOLECULES THAT MEDIATE VISION. THEY CONSIST OF AN APOPROTEIN, OPSIN, COVALENTLY LINKED TO CIS-RETINAL. -!- RHODOPSIN IS FOUND IN ROD SHAPED PHOTORECEPTOR CELLS WHICH MEDIATES VISION IN DIM LIGHT. -!- RHODOPSIN HAS AN ABSORPTION MAXIMA AT 495 NM. -!- SOME OR ALL OF THE CARBOXYL-TERMINAL SER OR THR RESIDUES MAY BE PHOSPHORYLATED. -!- SIMILARITY: TO ALL OTHER G-PROTEIN COUPLED RECEPTORS. PIR; S01656; S01656. PROSITE; PS00237; G_PROTEIN_RECEPTOR. PROSITE; PS00238; OPSIN. CARBOHYD 2 2 BY HOMOLOGY. CARBOHYD 15 15 BY HOMOLOGY. BINDING 296 296 RETINAL CHROMOPHORE. BINDING 322 322 PALMITYL (BY HOMOLOGY). BINDING 323 323 PALMITYL (BY HOMOLOGY). DOMAIN 1 36 EXTRACELLULAR. TRANSMEM 37 61 DOMAIN 62 73 CYTOPLASMIC. TRANSMEM 74 98 DOMAIN 99 113 EXTRACELLULAR. TRANSMEM 114 140 DOMAIN 141 152 CYTOPLASMIC. TRANSMEM 153 176 DOMAIN 173 202 EXTRACELLULAR. TRANSMEM 203 230 DOMAIN 231 252 CYTOPLASMIC. TRANSMEM 253 276 DOMAIN 277 284 EXTRACELLULAR. TRANSMEM 285 309 DOMAIN 310 348 CYTOPLASMIC. OPSD_OCTDO RHODOPSIN. - GIANT OCTOPUS (OCTOPUS DOFLEINI). Species: GIANT OCTOPUS (OCTOPUS DOFLEINI). Accession: P09241 01-MAR-1989 (REL. 10, CREATED) 01-MAR-1989 (REL. 10, LAST SEQUENCE UPDATE) 01-JAN-1990 (REL. 13, LAST ANNOTATION UPDATE) -!- FUNCTION: VISUAL PIGMENTS ARE THE LIGHT-ABSORBING MOLECULES THAT MEDIATE VISION. THEY CONSIST OF AN APOPROTEIN, OPSIN, COVALENTLY LINKED TO CIS-RETINAL. -!- RHODOPSIN IS FOUND IN ROD SHAPED PHOTORECEPTOR CELLS WHICH MEDIATES VISION IN DIM LIGHT. -!- RHODOPSIN HAS AN ABSORPTION MAXIMA AT 495 NM. -!- SOME OR ALL OF THE CARBOXYL-TERMINAL SER OR THR RESIDUES MAY BE PHOSPHORYLATED. -!- SIMILARITY: TO ALL OTHER G-PROTEIN COUPLED RECEPTORS. EMBL; X07797; PDRHOD. PROSITE; PS00237; G_PROTEIN_RECEPTOR. PROSITE; PS00238; OPSIN. CARBOHYD 9 9 PROBABLE. CARBOHYD 15 15 PROBABLE. BINDING 306 306 RETINAL CHROMOPHORE. BINDING 337 337 PALMITYL (BY HOMOLOGY). BINDING 338 338 PALMITYL (BY HOMOLOGY). DOMAIN 1 36 EXTRACELLULAR. TRANSMEM 37 61 DOMAIN 62 73 CYTOPLASMIC. TRANSMEM 74 98 DOMAIN 99 107 EXTRACELLULAR. TRANSMEM 108 131 DOMAIN 132 152 CYTOPLASMIC. TRANSMEM 153 176 DOMAIN 173 200 EXTRACELLULAR. TRANSMEM 201 224 DOMAIN 225 262 CYTOPLASMIC. TRANSMEM 263 287 DOMAIN 288 299 EXTRACELLULAR. TRANSMEM 300 323 DOMAIN 324 455 CYTOPLASMIC. Keywords: PHOTORECEPTOR; RETINAL PROTEIN; TRANSMEMBRANE; GLYCOPROTEIN; VISION; PHOSPHORYLATION; LIPOPROTEIN; G-PROTEIN COUPLED RECEPTOR. This command will show all comment information for every entry in the worklist as well as the title lines. Look in the comment information for entry QRHYB2 and you'll find `opsin' (figure 7). 2.5 An example fuzzy sequence search Sometimes you'll not want to look for exact sequence matches. Instead, you may want some latitude in the search. A fuzzy sequence match allows you to specify mismatches in the query sequence, the DELPHOS `fseq' function is used in these cases. Type DELPHOS> display/info fseq "(c)aq(ch) 1" (fig 8) Figure No.8 Matches for FSEQ probe (C)AQ(CH) are: CCHU 14 GDVEKGKKIFIMK CSQCH TVEKGGKHKTGPNLHG Cytochrome c - Human CCOS 14 GDIEKGKKIFVQK CSQCH TVEKGGKHKTGPNLDG Cytochrome c - Ostrich CCSF 14 GQVEKGKKIFVQR CAQCH TVEKAGKHKTGPNLNG Cytochrome c - Common European starfish CCAB 22 APPGBAKAGEKIFKTK CAQCH TVEKGAGHKQGPNLNG Cytochrome c - Chingma mallow CCBF6 14 ADIENGERIFTAN CAACH AGGNNVIMPEKTLKKD Cytochrome c6 - Bumilleriopsis filiformis RDC8_CANFA 71 VGVLAIPFAITISTGF CAACH NCLFFACFVLVLTQSS PROBABLE G PROTEIN-COUPLED RECEPTOR RDC8 (GENE NAME: RDC8). - CYTO84 22 APPGNPKAGEKIFKTK CAQCH TVEKGAGHKQGPNLNG CYTOCHROME C TOMATO - TOMATO (LYCOPERSICON ESCULENTUM) NRL_1CYC1 14 GDVAKGKKTFVQK CAQCH TVENGGKHKVGPNLWG FERROCYTOCHROME C - BONITO (KATSUWONUS PELAMIS, LINNAEUS) NRL_3CYT1 14 GDVAKGKKTFVQK CAQCH TVENGGKHKVGPNLWG CYTOCHROME C (OXIDIZED) - ALBACORE TUNA (THUNNUS ALALUNGA) HEA WORKLIST ENTRIES (9): CCHU Cytochrome c - Human CCOS Cytochrome c - Ostrich CCSF Cytochrome c - Common European starfish CCAB Cytochrome c - Chingma mallow CCBF6 Cytochrome c6 - Bumilleriopsis filiformis RDC8_CANFA PROBABLE G PROTEIN-COUPLED RECEPTOR RDC8 (GENE NAME: RDC8). - DOG (CANIS FAMILIARIS). CYTO84 CYTOCHROME C TOMATO - TOMATO (LYCOPERSICON ESCULENTUM) NRL_1CYC1 FERROCYTOCHROME C - BONITO (KATSUWONUS PELAMIS, LINNAEUS) NRL_3CYT1 CYTOCHROME C (OXIDIZED) - ALBACORE TUNA (THUNNUS ALALUNGA) HEART This query consists of two parts. The first part specifies the search sequence in single letter amino acid codes with optional parentheses. Mismatches are only allowed for those letters which are not enclosed by parentheses. The second part is a positive integer (or a zero) which specifies the maximum number of allowed mismatches. The query above therefore gets all entries which contain cys-ala-anything-cys-his or cys-anything-gln-cys-his This is because 1 mismatch is allowed and there are only 2 residues where this may happen (fig 8). Fseq also allows you to specify that you don't mind what a particular amino acid is; this is done using the letter `x'. Type DELPHOS> display/info fseq "(c)xx(ch) 0" Figure No.9 Matches for FSEQ probe (C)XX(CH) are: CCHU 14 GDVEKGKKIFIMK CSQCH TVEKGGKHKTGPNLHG Cytochrome c - Human CCOS 14 GDIEKGKKIFVQK CSQCH TVEKGGKHKTGPNLDG Cytochrome c - Ostrich CCSF 14 GQVEKGKKIFVQR CAQCH TVEKAGKHKTGPNLNG Cytochrome c - Common European starfish CCAB 22 APPGBAKAGEKIFKTK CAQCH TVEKGAGHKQGPNLNG Cytochrome c - Chingma mallow CCNA5A 13 GDVEAGKAAFNK CKACH EIGESAKNKVGPELDG Cytochrome c550 - Nitrobacter winogradskyi CCBF6 14 ADIENGERIFTAN CAACH AGGNNVIMPEKTLKKD Cytochrome c6 - Bumilleriopsis filiformis CCDS7 26 KGNVTFDHKAHAEKLG CDACH EGTPAKIAIDKKSAHK Cytochrome c7 (c551.5) - Desulfuromonas acetoxidans CCDS7 49 TPAKIAIDKKSAHKDA CKTCH KSNNGPTKCGGCHIK Cytochrome c7 (c551.5) - Desulfuromonas acetoxidans CCDS7 62 KDACKTCHKSNNGPTK CGGCH IK Cytochrome c7 (c551.5) - Desulfuromonas acetoxidans CCRFCX 117 GEASAFGPALKKLGGT CKACH DDYRAEH Cytochrome c' - Rhodopseudomonas sp. C553_DESVH 34 LAVSGVAADGAALYKS CIGCH GADGSKAAMGSAKPVK CYTOCHROME C553 PRECURSOR. - DESULFOVIBRIO VULGARIS (STRAIN HI CYCR_RHOVI 107 LRTMTAITEWVSPQEG CTYCH DENNLASEAKYPYVVA CYTOCHROME C SUBUNIT OF THE PHOTOSYNTHETIC REACTION CENTER PRE CYCR_RHOVI 152 AINTNWTQHVAQTGVT CYTCH RGTPLPPYVRYLEPTL CYTOCHROME C SUBUNIT OF THE PHOTOSYNTHETIC REACTION CENTER PRE CYCR_RHOVI 264 ATFALMMSISDSLGTN CTFCH NAQTFESWGKKSTPQR CYTOCHROME C SUBUNIT OF THE PHOTOSYNTHETIC REACTION CENTER PRE CYCR_RHOVI 325 LPASRLGRQGEAPQAD CRTCH QGVTKPLFGASRLKDY CYTOCHROME C SUBUNIT OF THE PHOTOSYNTHETIC REACTION CENTER PRE RDC8_CANFA 71 VGVLAIPFAITISTGF CAACH NCLFFACFVLVLTQSS PROBABLE G PROTEIN-COUPLED RECEPTOR RDC8 (GENE NAME: RDC8). - PDECYT550 35 AAQDGDAAKGEKEFNK CKACH MIQAPDGTDIIKGGKT PDECYT550 cytochrome c550 precursor - Paracoccus denitrificans PALMT13 14 RKVHAKGASLFFI CMYCH IGRGLYYG PALMT13 cytochrome b (AA at 1) - Mitochondrion Paracentrotus l CYTO84 22 APPGNPKAGEKIFKTK CAQCH TVEKGAGHKQGPNLNG CYTOCHROME C TOMATO - TOMATO (LYCOPERSICON ESCULENTUM) NRL_155C1 15 NEGDAAKGEKEFNK CKACH MIQAPDGTDIKGGKTG CYTOCHROME C550 - (PARACOCCUS DENITRIFICANS) ATCC 13543 NRL_1CYC1 14 GDVAKGKKTFVQK CAQCH TVENGGKHKVGPNLWG FERROCYTOCHROME C - BONITO (KATSUWONUS PELAMIS, LINNAEUS) NRL_2C2C1 14 EGDAAAGEKVSKK CLACH TFDQGGANKVGPNLFG CYTOCHROME C2 (OXIDIZED) - (RHODOSPIRILLUM RUBRUM) NRL_2CCY1 117 AGPDALKAQAAATGKV CKACH EEFKQD CYTOCHROME C' - (RHODOSPIRILLUM MOLISCHIANUM) NRL_2CDV1 30 TKQPVVFNHSTHKAVK CGDCH HPVNGKENYQKCATAG CYTOCHROME C3 - (DESULFOVIBRIO VULGARIS MIYAZAKI IAM 12604) NRL_2CDV1 79 KGYYHAMHDKGTKFKS CVGCH LETAGADAAKKKELTG CYTOCHROME C3 - (DESULFOVIBRIO VULGARIS MIYAZAKI IAM 12604) NRL_351C1 12 EDPEVLFKNKG CVACH AIDTKMVGPAYKDVAA CYTOCHROME C551 (OXIDIZED) - (PSEUDOMONAS AERUGINOSA) NRL_3CYT1 14 GDVAKGKKTFVQK CAQCH TVENGGKHKVGPNLWG CYTOCHROME C (OXIDIZED) - ALBACORE TUNA (THUNNUS ALALUNGA) HEA WORKLIST ENTRIES (21): CCHU Cytochrome c - Human CCOS Cytochrome c - Ostrich CCSF Cytochrome c - Common European starfish CCAB Cytochrome c - Chingma mallow CCNA5A Cytochrome c550 - Nitrobacter winogradskyi CCBF6 Cytochrome c6 - Bumilleriopsis filiformis CCDS7 Cytochrome c7 (c551.5) - Desulfuromonas acetoxidans CCRFCX Cytochrome c' - Rhodopseudomonas sp. C553_DESVH CYTOCHROME C553 PRECURSOR. - DESULFOVIBRIO VULGARIS (STRAIN HILDENBOROUGH). CYCR_RHOVI CYTOCHROME C SUBUNIT OF THE PHOTOSYNTHETIC REACTION CENTER PRECURSOR (C558/C559). - RHODOPSEUDOMONAS VIRIDIS. RDC8_CANFA PROBABLE G PROTEIN-COUPLED RECEPTOR RDC8 (GENE NAME: RDC8). - DOG (CANIS FAMILIARIS). PDECYT550 PDECYT550 P.denitrificans cytochrome c550 gene, complete cds, and iso-cytochrome oxidase subunit I (iso-COI) gene, 5' end. - Paracoccus denitrificans Prokaryota PALMT13 PALMT13 P.lividus mitochondrial (Bam2 B fragment) cytochrome b, partial cds. - Mitochondrion Paracentrotus lividus Eukaryota CYTO84 CYTOCHROME C TOMATO - TOMATO (LYCOPERSICON ESCULENTUM) NRL_155C1 CYTOCHROME C550 - (PARACOCCUS DENITRIFICANS) ATCC 13543 NRL_1CYC1 FERROCYTOCHROME C - BONITO (KATSUWONUS PELAMIS, LINNAEUS) NRL_2C2C1 CYTOCHROME C2 (OXIDIZED) - (RHODOSPIRILLUM RUBRUM) NRL_2CCY1 CYTOCHROME C' - (RHODOSPIRILLUM MOLISCHIANUM) NRL_2CDV1 CYTOCHROME C3 - (DESULFOVIBRIO VULGARIS MIYAZAKI IAM 12604) NRL_351C1 CYTOCHROME C551 (OXIDIZED) - (PSEUDOMONAS AERUGINOSA) NRL_3CYT1 CYTOCHROME C (OXIDIZED) - ALBACORE TUNA (THUNNUS ALALUNGA) HEART 2.6 Looking at one protein and introducing DISPLAY flexibility You can look at any protein you want in the database by using the `code' function. Type DELPHOS> display code "oobo" (fig 10) Figure No.10 WORKLIST ENTRIES (1): OOBO Rhodopsin - Bovine This will put just one entry into the worklist, that entry with the pcode `oobo' namely bovine rhodopsin. You'll just have got the title line again. Obviously a database entry contains more information than the title line, the sequence for example! To display the sequence of this entry, which is now in the worklist type DELPHOS> display/sequence (fig 11) Figure No.11 WORKLIST ENTRIES (1): OOBO Rhodopsin - Bovine Ala A 29 Cys C 10 Asp D 5 Glu E 17 Phe F 31 Gly G 23 His H 6 Ile I 22 Lys K 11 Leu L 28 Met M 16 Asn N 15 Pro P 20 Gln Q 12 Arg R 7 Ser S 15 Thr T 27 Val V 31 Trp W 5 Tyr Y 18 Mol. wt. (calc) = 38962 Residues = 348 1 M N G T E G P N F Y V P F S N K T G V V R S P F E A P Q Y Y 31 L A E P W Q F S M L A A Y M F L L I M L G F P I N F L T L Y 61 V T V Q H K K L R T P L N Y I L L N L A V A D L F M V F G G 91 F T T T L Y T S L H G Y F V F G P T G C N L E G F F A T L G 121 G E I A L W S L V V L A I E R Y V V V C K P M S N F R F G E 151 N H A I M G V A F T W V M A L A C A A P P L V G W S R Y I P 181 E G M Q C S C G I D Y Y T P H E E T N N E S F V I Y M F V V 211 H F I I P L I V I F F C Y G Q L V F T V K E A A A Q Q Q E S 241 A T T Q K A E K E V T R M V I I M V I A F L I C W L P Y A G 271 V A F Y I F T H Q G S D F G P I F M T I P A F F A K T S A V 301 Y N P V I Y I M M N K Q F R N C M V T T L C C G K N P L G D 331 D E A S T T V S K T E T S Q V A P A Similarly, to display authors and papers, alternative names and comment information try typing DELPHOS> display/author (fig 12) Figure No.12 WORKLIST ENTRIES (1): OOBO Rhodopsin - Bovine Nathans, J., and Hogness, D.S.Cell 34, 807-814, 1983 (Sequence translated from the DNA sequence) Ovchinnikov, Y.A.FEBS Lett. 148, 179-191, 1982 (Complete sequence) Koike, S., Nabeshima, Y., Ogata, K., Fukui, T., Ohtsuka, E., Ikehara, M., and Tokunaga, F.Biochem. Biophys. Res. Commun. 116, 563-567, 1983 (Sequence of residues 205-348 translated from the mRNA sequence) This sequence differs from that shown in having 213-Val. Hargrave, P.A.submitted to the Protein Sequence Database, June 1984 (Carbohydrate binding sites) Mullen, E., and Akhtar, M.Biochem. J. 211, 45-54, 1983 (Retinal binding site) DELPHOS> display/alternative (oobo has no alternative names) (fig 13) Figure No.13 WORKLIST ENTRIES (1): OOBO Rhodopsin - Bovine DELPHOS> display/comment (fig 14) Figure No.14 WORKLIST ENTRIES (1): OOBO Rhodopsin - Bovine Species: Bos primigenius taurus (cattle) Accession: A03154 Introns: 121/1, 177/2, 232/3, 312/3 Superfamily: vertebrate rhodopsin Keywords: photoreceptor; chromoprotein; glycoprotein; acetylation; transmembrane protein 1/Modified site: acetylated amino end 2,15/Binding site: carbohydrate (Asn) 296/Binding site: retinal chromophore The commands above list each type of information separately. If you want the whole lot in one go type DELPHOS> display/full (fig 15) Figure No.15 WORKLIST ENTRIES (1): OOBO Rhodopsin - Bovine Species: Bos primigenius taurus (cattle) Accession: A03154 Nathans, J., and Hogness, D.S.Cell 34, 807-814, 1983 (Sequence translated from the DNA sequence) Ovchinnikov, Y.A.FEBS Lett. 148, 179-191, 1982 (Complete sequence) Koike, S., Nabeshima, Y., Ogata, K., Fukui, T., Ohtsuka, E., Ikehara, M., and Tokunaga, F.Biochem. Biophys. Res. Commun. 116, 563-567, 1983 (Sequence of residues 205-348 translated from the mRNA sequence) This sequence differs from that shown in having 213-Val. Hargrave, P.A.submitted to the Protein Sequence Database, June 1984 (Carbohydrate binding sites) Mullen, E., and Akhtar, M.Biochem. J. 211, 45-54, 1983 (Retinal binding site) Introns: 121/1, 177/2, 232/3, 312/3 Superfamily: vertebrate rhodopsin Keywords: photoreceptor; chromoprotein; glycoprotein; acetylation; transmembrane protein 1/Modified site: acetylated amino end 2,15/Binding site: carbohydrate (Asn) 296/Binding site: retinal chromophore Ala A 29 Cys C 10 Asp D 5 Glu E 17 Phe F 31 Gly G 23 His H 6 Ile I 22 Lys K 11 Leu L 28 Met M 16 Asn N 15 Pro P 20 Gln Q 12 Arg R 7 Ser S 15 Thr T 27 Val V 31 Trp W 5 Tyr Y 18 Mol. wt. (calc) = 38962 Residues = 348 1 M N G T E G P N F Y V P F S N K T G V V R S P F E A P Q Y Y 31 L A E P W Q F S M L A A Y M F L L I M L G F P I N F L T L Y 61 V T V Q H K K L R T P L N Y I L L N L A V A D L F M V F G G 91 F T T T L Y T S L H G Y F V F G P T G C N L E G F F A T L G 121 G E I A L W S L V V L A I E R Y V V V C K P M S N F R F G E 151 N H A I M G V A F T W V M A L A C A A P P L V G W S R Y I P 181 E G M Q C S C G I D Y Y T P H E E T N N E S F V I Y M F V V 211 H F I I P L I V I F F C Y G Q L V F T V K E A A A Q Q Q E S 241 A T T Q K A E K E V T R M V I I M V I A F L I C W L P Y A G 271 V A F Y I F T H Q G S D F G P I F M T I P A F F A K T S A V 301 Y N P V I Y I M M N K Q F R N C M V T T L C C G K N P L G D 331 D E A S T T V S K T E T S Q V A P A you'll get all the information (authors, alternative names, comments and the sequence) available in the database for this protein. Two other qualifiers to the display command are `output' and `printer'. The printer option sends everything that appears on the screen to the printer associated with your computer enabling you to keep a permanent record on paper of the results of your query. Some sites do not have a direct computer-printer connection; in this case the `output' qualifier is the most useful. The `output' qualifier sends everything that appears on the screen to a specified disc file as well. This file can then be sent to any printer you wish. We recommend the use of `output' rather than `printer'. Try typing DELPHOS> display/output=oobo.title Everything appears as before, the title line for OOBO is shown, but if you now leave DELPHOS by typing DELPHOS> quit you'll find a file called `oobo.title' in your directory. If you're on a VMS system you can look at the file by typing $ type oobo.title or, if you're on a UNIX machine type % cat oobo.title In lots of cases the title line information is not enough, you'll want full information. DELPHOS allows any combination of qualifiers to the display command. Reenter DELPHOS and type DELPHOS> display code "oobo" DELPHOS> display/full/output=oobo.full The resulting file `oobo.full' will contain all available database information on bovine rhodopsin. Leave DELPHOS again, examine the file, then reenter DELPHOS. The beauty of the display qualifiers is that they can be used with any display command. The file `oosh.full' was created using two steps in the last example but it could have been created in one step by typing DELPHOS> display/full/output=oosh.full code "oobo" The use of the display qualifiers is not restricted to the `code' function; they can be used with `seq', `title', `text' and `fseq' as well. For example, try DELPHOS> display/sequence seq "vpfsn" DELPHOS> display/author text "hogness" DELPHOS> display/comment/author/output=opsin.dat title "opsin" The one restriction on the use of the display qualifiers is that the `output' and `printer' qualifiers are mutually exclusive. One display qualifier not yet mentioned is `info'. This qualifier is rather special and is described in greater detail later. However, one of its properties is that it displays context information with the `seq' function. Try typing DELPHOS> display seq "vpfsn" and then DELPHOS> display/info seq "vpfsn" (fig 16) Figure No.16 Matches for SEQ probe VPFSN are: No. of matches = 4 OOBO 11 MNGTEGPNFY VPFSN KTGVVRSPFEAPQYYL COX2_PARLI 214 EICGANHSFMPILIES VPFSN FENWVAQYIEE OPSD_HUMAN 11 MNGTEGPNFY VPFSN ATGVVRSPFEYPQYYL OPSD_MOUSE 11 MNGTEGPNFY VPFSN VTGVGRSPFEQPQYYL WORKLIST ENTRIES (4): OOBO Rhodopsin - Bovine COX2_PARLI CYTOCHROME C OXIDASE POLYPEPTIDE II (EC 1.9.3.1) (GENE NAME: COII) . - SEA URCHIN (PARACENTROTUS LIVIDUS). OPSD_HUMAN RHODOPSIN. - HUMAN (HOMO SAPIENS). OPSD_MOUSE RHODOPSIN. - MOUSE (MUS MUSCULUS). The matching bits of sequence are only displayed if you use the `info' qualifier. 2.7 Introducing NOT There are some circumstances where you'll want to find all the database entries which DO NOT have a particular characteristic. In these circumstances you use the `not' function. This function can be put before any of the other functions i.e. `seq', `title', `code', `text' and `fseq'. Try typing DELPHOS> display not title "opsin" What you'll get is all the database entries which DON'T contain the word `opsin' in their title lines, 489 of them! The `not' function is particularly useful in the complex queries described later. 2.8 Multiple parameters Up to now the tutorial has only shown simple queries; those which only have one parameter. Keeping with our opsin examples, lets assume you're only interested in one particular opsin, bovine rhodopsin. Further, lets assume you don't know the pcode of bovine rhodopsin in the database. The two queries DELPHOS> display title "rhodopsin" DELPHOS> display title "bovine" are not good enough for what you need. To find the sequence you want you'd probably have to correlate the results from both queries. DELPHOS provides the means to look for bovine rhodopsin in one go using compound parameters. Type DELPHOS> display title "rhodopsin bovine" What you get back in the worklist are only those entries which have BOTH the words `rhodopsin' and `bovine' in the title line. The order of the parameters is unimportant, type DELPHOS> display title "bovine rhodopsin" and you'll get the same result. Note that if you'd typed DELPHOS> display title "bovinerhodopsin" or abbreviated it to DELPHOS> display title "vinerhod" then you'd miss those entries which contained the word `rhodopsin' before the word `bovine'. Also note that, using the flavour of free text searching DELPHOS> display title "ovin hodops" would work equally well and would take less time to find the entries, after all, there is less to search for. You can use compound parameters with the `seq' and `text' functions as well. They have the same meaning as for the `title' function i.e. DELPHOS> display seq "vpfsn tetsq" DELPHOS> display text "opsin hogness" would, in the first example, find only those proteins which contained both val-pro-phe-ser-asn and thr-glu-thr-ser-gln in the same sequence. The second example would find only those proteins which contained both the words `opsin' and `hogness' within the same entry. Multiple parameters when used with the `code' function have a different meaning e.g. try DELPHOS> display code "opsd_human oobo" It would be meaningless to find those entries which had both the pcode `opsd_human' and the pcode `oobo', this would be a paradox as all pcodes are unique! Instead, what happened was that DELPHOS put BOTH the entries in the worklist. Multiple parameters to `code' therefore do what you'd intuitively expect. `Fseq' is the only exception as far as multiple parameters are concerned. YOU CANNOT USE MULTIPLE PARAMETERS WITH FSEQ. At this point we can reintroduce the `info' qualifier to the display command. You may want a running commentary on the hits DELPHOS finds for each parameter in a multiple parameter query. Try typing DELPHOS> display/info seq "vpfsn tetsq" (fig 17) Figure No.17 Matches for SEQ probe VPFSN are: No. of matches = 4 OOBO 11 MNGTEGPNFY VPFSN KTGVVRSPFEAPQYYL COX2_PARLI 214 EICGANHSFMPILIES VPFSN FENWVAQYIEE OPSD_HUMAN 11 MNGTEGPNFY VPFSN ATGVVRSPFEYPQYYL OPSD_MOUSE 11 MNGTEGPNFY VPFSN VTGVGRSPFEQPQYYL Matches for SEQ probe TETSQ are: No. of matches = 3 OOBO 340 GKNPLGDDEASTTVSK TETSQ VAPA OPSD_HUMAN 340 GKNPLGDDEASATVSK TETSQ VAPA OPSD_MOUSE 340 GKNPLGDDDASATASK TETSQ VAPA WORKLIST ENTRIES (3): OOBO Rhodopsin - Bovine OPSD_HUMAN RHODOPSIN. - HUMAN (HOMO SAPIENS). OPSD_MOUSE RHODOPSIN. - MOUSE (MUS MUSCULUS). DELPHOS will show all the database proteins which contained `vpfsn' and all the proteins which contained `tetsq', only then will it show the worklist (as usual) which contains the entries possessing both pentapeptides. N.B. DELPHOS only saves the worklist obtained from a multiple parameter query with the `info' parameter and not the intermediate results. If you want to keep a record of the intermediate results use the `output' qualifier to save all the displayed information to a disc file. 2.9 Complex queries By now you should be able to give DELPHOS any simple query with single or multiple parameters and display all or part of the database information by using parameters. You now know how to redirect this information to a disc file or to a printer. Biological queries however are usually not as cut-and-dried as the examples given above. Multiple parameters allow you to relate sequence information to other sequence information and text information to other text information. However, multiple parameters do not allow you to relate sequence information to text information. This is one of the reasons why DELPHOS allows complex queries. As an example type DELPHOS> display seq "vpfsn" and text "opsin" (fig 18) Figure No.18 WORKLIST ENTRIES (3): OOBO Rhodopsin - Bovine OPSD_HUMAN RHODOPSIN. - HUMAN (HOMO SAPIENS). OPSD_MOUSE RHODOPSIN. - MOUSE (MUS MUSCULUS). This query will put in the worklist only those protein entries which contain BOTH the sequence `val-pro-phe-ser-asn' AND the text `opsin', one or the other will not do. In other words, the above query allows sequence/text correlations. We now have to introduce the concept of the `operator'. Operators are words such as `and' which join parts of a complex query together. Lets give another example using the `or' operator. Type DELPHOS> display seq "vpfsn" or text "opsin" (fig 19) Figure No.19 WORKLIST ENTRIES (16): OOBO Rhodopsin - Bovine OOFF Rhodopsin - Fruit fly OOFF2 Opsin 2 - Fruit fly OOHUB Blue-sensitive opsin - Human OOHUR Red-sensitive opsin - Human OOHUG Green-sensitive opsin - Human QRHYB2 Beta-2-adrenergic receptor - Hamster COX2_PARLI CYTOCHROME C OXIDASE POLYPEPTIDE II (EC 1.9.3.1) (GENE NAME: COII) . - SEA URCHIN (PARACENTROTUS LIVIDUS). GBT1_BOVIN GUANINE NUCLEOTIDE-BINDING PROTEIN G(T), ALPHA-1 SUBUNIT ( TRANSDUCIN ALPHA-1 CHAIN). - BOVINE (BOS TAURUS). GBT1_HUMAN GUANINE NUCLEOTIDE-BINDING PROTEIN G(T), ALPHA-1 SUBUNIT ( TRANSDUCIN ALPHA-1 CHAIN) (GENE NAME: GNAT1). - HUMAN (HOMO SAPIENS). GBT2_BOVIN GUANINE NUCLEOTIDE-BINDING PROTEIN G(T), ALPHA-2 SUBUNIT ( TRANSDUCIN ALPHA-2 CHAIN). - BOVINE (BOS TAURUS). OPS3_DROME OPSIN RH3 (INNER R7 PHOTORECEPTOR CELLS OPSIN) (GENE NAME: RH3 OR RH92CD). - FRUIT FLY (DROSOPHILA MELANOGASTER). OPS4_DROME OPSIN RH4 (INNER R7 PHOTORECEPTOR CELLS OPSIN) (GENE NAME: RH4). - FRUIT FLY (DROSOPHILA MELANOGASTER). OPSD_HUMAN RHODOPSIN. - HUMAN (HOMO SAPIENS). OPSD_MOUSE RHODOPSIN. - MOUSE (MUS MUSCULUS). OPSD_OCTDO RHODOPSIN. - GIANT OCTOPUS (OCTOPUS DOFLEINI). This query will retrieve the protein entries which contain EITHER the peptide `vpfsn' or the word `opsin' or BOTH. That is, any entry which contains one or the other or both gets put in the worklist. Note how the `or' operator differs from the `and' operator. Just to make it clearer we have provided an operator called `add' which does just the same as `or'. For example DELPHOS> display seq "vpfsn" add text "opsin" is precisely equivalent to the preceding query. You can relate any function to any other function using these operators. Other operators available to you include `xor' which stands for `exclusive or'. The query DELPHOS> display seq "vpfsn" xor seq "tetsq" (fig 20) Figure No.20 WORKLIST ENTRIES (4): OOBO Rhodopsin - Bovine COX2_PARLI CYTOCHROME C OXIDASE POLYPEPTIDE II (EC 1.9.3.1) (GENE NAME: COII) . - SEA URCHIN (PARACENTROTUS LIVIDUS). OPSD_HUMAN RHODOPSIN. - HUMAN (HOMO SAPIENS). OPSD_MOUSE RHODOPSIN. - MOUSE (MUS MUSCULUS). will put in the worklist those sequences which contain either `vpfsn' or `tetsq' BUT NOT THOSE WHICH CONTAIN BOTH. Another operator is `subtract'. What this does is to subtract the results of one query from the results of another. For example, the query DELPHOS> display seq "vpfsn" subtract text "opsin" (fig 21) Figure No.21 WORKLIST ENTRIES (1): COX2_PARLI CYTOCHROME C OXIDASE POLYPEPTIDE II (EC 1.9.3.1) (GENE NAME: COII) . - SEA URCHIN (PARACENTROTUS LIVIDUS). will put into the work list only those proteins which contain the peptide `vpfsn' that contain no mention of `opsin' within the text. You can think of `subtract' as being equivalent to `and not' therefore the query DELPHOS> display seq "vpfsn" and not text "opsin" is completely equivalent to the preceding example but is arguably more difficult to understand. You can add the `not' operator after any other operator. For example DELPHOS> display seq "vpfsn" or not seq "tetsq" will put in the worklist any sequence which contains `vpfsn' and also any sequence which doesn't contain `tetsq'. Re-read this section and make sure you understand it before proceeding. 2.10 Very complex queries The complex query is not the limit of DELPHOS, it allows very complex queries as well. Very complex queries can be defined as those with more than two functions. Using functions and operators you can make any arbitrarily complex query. For very complex queries you can make the meaning clear by adding parentheses! Type the following query DELPHOS> display (seq "vpfsn" and seq "tetsq") or text "hogness" What does it do? Well, this example is relatively easy and does precisely what you'd expect. The term in parentheses finds those entries which contain BOTH peptides, the term outside the parentheses finds all entries containing `hogness', the sum of both terms then forms the worklist. To put it another way, the worklist will contain all the `hogness' entries plus those sequences with both `vpfsn' and `tetsq' in them. What about... DELPHOS> display seq "vpfsn" and (seq "tetsq" or text "hogness") ... this is obviously different kettle of fish but again, if you look closely it also does what you'd expect. First, it puts together those proteins which contain either the peptide `tetsq' or the name `hogness' or both, it then selects from this group only those sequences which contain the peptide `vpfsn' and puts them in the worklist. This begs the question... what does the following query do? Try it DELPHOS> display seq "vpfsn" and seq "tetsq" or text "hogness" It could do one or the other of the last two examples. It is actually equivalent to DELPHOS> display seq "vpfsn" and (seq "tetsq" or text "hogness") This is an important point about DELPHOS. If you don't put parentheses round terms in a very complex query DELPHOS works things out from right to left. It is good practice to use parentheses to make the meaning of a query entirely clear. You can put in as many as you like providing they balance. For example DELPHOS> display (seq "vpfsn") and ((seq "tetsq" or text "hogness")) is perfectly acceptable and is equivalent to the last example. DELPHOS allows you to nest parenthesised queries to any depth. In practice though you can make things easier on yourself by breaking up the query into manageable chunks and fitting everything together using the DELPHOS list commands. 2.11 Complex queries made easy using lists. `Display' is only one of many DELPHOS commands. It is the one most used for browsing through the database. Some other commands deal with lists. DELPHOS lists make life easy for you. There are two lists available to the DELPHOS user. One you've already met, the WORKLIST. The worklist, as its name implies, contains the set of proteins you're currently working on, typically this will be the results of the last query but not necessarily so. This is because of the existence of the STORELIST. This list is just what it says, a list which can act as a temporary store of a set of protein entries. The WORKLIST and the STORELIST can hold a set of protein pcodes (the unique identifiers of the database proteins). You can transfer information from one list to another in several ways. Type DELPHOS> display text "opsin" DELPHOS> storework DELPHOS> display code "xxx" What this has done is to put all the opsins in the WORKLIST using the `display' command. The `storework' command copied the contents of the WORKLIST to the STORELIST overwriting anything that was there before (if anything). The final `display' command looked for a pcode which doesn't exist. This leaves you with no entries in the WORKLIST, you can verify this by typing DELPHOS> display The `display' command works exclusively on the WORKLIST. You can recover the list of opsins, currently held in the STORELIST, by typing DELPHOS> recallwork DELPHOS> display The `recallwork' command copies the contents of the STORELIST to the WORKLIST overwriting what was there before (in this case nothing) and then the `display' command shows you what is in the WORKLIST. At this moment the WORKLIST and STORELIST contain precisely the same set of pcodes. Now type DELPHOS> display seq "vpfsn" The WORKLIST now contains all the protein entries which contain the pentapeptide "vpfsn", the STORELIST contains all the opsins. You can interchange the contents of the two lists by typing DELPHOS> swapwork then type DELPHOS> display to verify the interchange. Type DELPHOS> swapwork DELPHOS> display and you're back where you started. You can also save the contents of the WORKLIST or STORELIST to a disc file and load it back in again later. The commands to use are `worksave', `workread', `storesave' and `storeread'. Type DELPHOS> display title "rhodopsin" DELPHOS> worksave work.tmp DELPHOS> display title "cytochrome" DELPHOS> storework DELPHOS> display/full code "oobo" DELPHOS> workread work.tmp DELPHOS> display You end up with the rhodopsins in the WORKLIST and the cytochromes in the STORELIST after having had a quick look at pcode `oobo'! Note that if you don't give the list save and read commands the name of a disc file (work.tmp in the last example) they will prompt you for one. Having described the lists we can now explain how they can help you break down complex queries into manageable chunks. Take as an example the query we used earlier. DELPHOS> display (seq "vpfsn" and seq "tetsq") or text "hogness" First of all, you can type DELPHOS> display seq "vpfsn" and seq "tetsq" DELPHOS> storework This takes the first part of the query (the section in parentheses), works it out and puts it as usual into the WORKLIST. The second command makes a copy of the WORKLIST in the STORELIST. Now you're ready to type DELPHOS> display text "hogness" So now you've got the first part of the query in the STORELIST and the last part of the complex query in the WORKLIST. All you need do now is to `or' them. You do this by typing DELPHOS> orlists This command performs the `or' of the STORELIST with the WORKLIST and leaves the result in the WORKLIST. The STORELIST remains unchanged. You can now examine the WORKLIST by typing DELPHOS> display The other list operator commands are `andlists' and `xorlists'. They both, like `orlists', put the result in the WORKLIST and leave the STORELIST unchanged. So, for example, the query DELPHOS> display seq "vpfsn" and (seq "tetsq" or text "hogness") can be broken down into the following steps. DELPHOS> display seq "tetsq" or text "hogness" DELPHOS> storework DELPHOS> display seq "vpfsn" DELPHOS> andlists DELPHOS> display Another useful command is `negwork'. This command performs a `not' on the WORKLIST i.e. it replaces whatever was in the WORKLIST with whatever wasn't! For example, if you wished, you could emulate the query DELPHOS> display not text "cytochrome" by typing DELPHOS> display text "cytochrome" DELPHOS> negwork DELPHOS> display To summarise, using the DELPHOS lists and list operations you can break down any arbitrarily complex query into small chunks. You can also save either or both lists to disc, leave DELPHOS, do something else, return to DELPHOS, load back the lists from disc and carry on where you left off. 2.12 Shortcuts DELPHOS allows you to abbreviate commands and qualifiers down to the point of no ambiguity with other commands or qualifiers. For example DELPHOS> swapwork can be replaced by, DELPHOS> sw it cannot be replaced by simply `s' as this could be confused with `storework', `storesave' and `storework' and you'd be given a rude message. Similarly, the query DELPHOS> display/author/comment/output=a.a title "opsin" can be replaced by DELPHOS> d/au/c/o=a.a title "opsin" as there is no conflict with any other command or qualifier. Remember though that YOU CANNOT ABBREVIATE FUNCTIONS (e.g. `seq') OR OPERATORS (e.g. `and'). The command `display' holds a privileged place in DELPHOS. If there is no other command given and there is no ambiguity anywhere else then DELPHOS assumes the `display' command has been given, so the following queries are all equivalent. DELPHOS> display/author/comment/output=a.a title "opsin" DELPHOS> d/au/c/o=a.a title "opsin" DELPHOS> /au/c/o=a.a title "opsin" As are .. DELPHOS> display seq "vpfsn" DELPHOS> seq "vpfsn" If you're at the DELPHOS> prompt though, to redisplay the worklist you have to type at least DELPHOS> d Another useful facility is the ability to execute an operating system command from within DELPHOS. You do this by preceding the command with a dollar (`$') symbol. For example, DELPHOS> $directory (VMS) DELPHOS> $ls (UNIX) will list the current directory. If you just type.. DELPHOS> $ .. then a subprocess will be created and you'll be returned to the operating system. You can type commands as normal then, when you've finished, `logout' and you'll be returned to DELPHOS. The rest of this documentation does not use these abbreviations for the sake of clarity. It is extremely useful to remember these shortcuts as they save a lot of finger-ache. Try using them in the remainder of the tutorial. 2.13 Tailoring the worklist Occasionally a query will give a false positive result in which case it is advantageous to be able to remove an entry, or set of entries, from the worklist. You can do this in DELPHOS using the `minuswork' command. This command can accept a query just like `display'. Assume for example you've got all the atpases by typing, DELPHOS> display title "opsin" but then remember you wanted all the opsins that don't contain the sequence `vpfsn'. All you need to type is DELPHOS> minuswork seq "vpfsn" As another example, assume you've got all the opsins using the query DELPHOS> display title "opsin" but then remember that you didn't want the one with pcode `ooff'. You can just type DELPHOS> minuswork code "ooff" The command `pluswork' has the opposite effect to `minuswork' it adds the results of a query to the worklist. For example, DELPHOS> pluswork code "oobo ooff" will add the two pcodes to the worklist, assuming they're not already there. 2.14 Creating database subsets Typically a researcher is interested in a related group of proteins. If a new sequence comes along and, for example, a similarity search of this sequence against a predefined set of other proteins is wanted then it does not make sense, either scientifically or with regard to computer time, to compare the sequence against the whole database. In cases like these it is advantageous to create a database subset and do the similarity search against the subset. The examples given in the tutorial so far have shown how easy it is to create a worklist containing sequences of interest. DELPHOS makes it easy to extract these sequences onto disc by providing the `createseq' and `createdb' commands. Createseq is used for extracting just the sequences into a file, `createdb' creates a sequence, reference and title file in NBRF-PIR format. `Createdb' will be infrequently required by a user, `createseq' is the command to use to create a database subset for similarity searching. Both the `create' commands can accept a query, just like `display'. If no query is given then the contents of the worklist will form the database, either way you are prompted for the name to give the disc file(s). The following commands produce the same effect DELPHOS> createseq title "opsin" Enter a name for the sequence file: opsin.seq or DELPHOS> display title "opsin" DELPHOS> createseq Enter a name for the sequence file: opsin.seq Both of the above operations will create the file `opsin.seq' which contains the sequences of all proteins which had the word `opsin' in their title line. This database is in the correct format for the ISIS similarity searching program SWEEP. DELPHOS also provides the `copy' command. This outputs the contents of the worklist to a disc file in the NBRF-PIR PSQ COPY format. 2.15 Reading SWEEP hit lists DELPHOS can also read hit lists produced from the SERPENT similarity searching program SWEEP. Refer to the SERPENT documentation for a description of SWEEP. It loads in the pcodes from any specified block in the hit list file. The unsegmented sequence block produced by SWEEP is referred to as `block 1', the first segmented sequence block is referred to as `block 2' etc. These blocks can be read into either the worklist of the storelist using the `hitwork' and `hitstore' commands respectively. These commands ask for the name of the SWEEP hits file and the block number of interest. 3. A summary of DELPHOS commands, qualifiers, functions and operators 3.1 Commands Command Function DISPLAY (Q,W) Query Result -> standard output & worklist CREATEDB (Q,W) Worklist -> NBRF-PIR .seq .ref and .ttl files CREATESEQ (Q,W) Worklist -> NBRF-PIR .seq file STOREWORK Worklist -> Storelist RECALLWORK Storelist -> Worklist SWAPWORK Transposes Storelist and Worklist WORKSAVE (D) Worklist -> Disc file WORKREAD (D) Disc file -> Worklist STORESAVE (D) Storelist -> Disc file STOREREAD (D) Disc file -> Storelist HITWORK (D) SWEEP Hit-list -> Worklist HITSTORE (D) SWEEP Hit-list -> Storelist ANDLISTS Storelist AND Worklist -> Worklist ORLISTS Storelist OR Worklist -> Worklist XORLISTS Storelist XOR Worklist -> Worklist PLUSWORK (Q) Worklist OR Query Result -> Worklist MINUSWORK (Q) Worklist AND NOT Query Result -> Worklist NEGWORK NOT Worklist -> Worklist NOT NOT Query -> Worklist COPY (Q,W) Worklist -> NBRF-PIR Copy-format disc file HELP Brief help -> standard output QUIT/EXIT/BYE Return to operating system Key: Q = accepts query W = accepts current worklist contents (see tutorial) D = accepts data The DELPHOS commands are summarised above. There follows a more detailed description of each. DISPLAY: This is the default command. Its use is assumed if no other command is given. This command may accept a query. The default output of this command is to send only the title line of each matching protein in the worklist to the screen. The output can be expanded and redirected using the qualifiers presented in the next table. If no query is given then the current contents of the worklist are redisplayed. CREATEDB: This command may accept a query. The contents of the worklist are used to create an NBRF-PIR format database of SEQ, REF and TTL files. If no query is given then the current contents of the worklist are used. The command prompts for a name for the database files. CREATESEQ: Similar to CREATEDB but only the NBRF-PIR SEQ file is produced. STOREWORK: Overwrites the storelist with the current worklist. The worklist is unaffected. RECALLWORK: Overwrites the worklist with the storelist contents. The storelist is unaffected. SWAPWORK: Transposes the worklist and storelist entries. WORKSAVE: Saves the worklist as protein identification codes (pcodes), to a named file. This command expects a filename but will prompt if one is not given. WORKREAD: Reads a file of pcodes into the worklist. STORESAVE: As for WORKSAVE but acts on the storelist. STOREREAD: As for WORKREAD but acts on the storelist. HITWORK: Loads a block of pcodes from a SWEEP hitlist into the worklist. Expects a filename and a block number but will prompt if either or both are missing. The SWEEP unsegmented sequence block is `block 1', the first segmented sequence block is `block 2' etc. HITSTORE: As for HITWORK but acts on the storelist. ANDLISTS: Performs the boolean AND of the storelist and the worklist leaving the result in the worklist. The storelist is unaffected. Only those pcodes common to both lists form the new worklist ORLISTS: As for ANDLISTS but the boolean OR is performed. The pcodes in both lists are added to form the new worklist. XORLISTS: As for ANDLISTS but the boolean eXclusive OR is performed. Only pcodes that were in one or other list, but not both, form the new work list. NEGWORK: The worklist entries are replaced by all the other entries in the database. PLUSWORK: This command expects a query and will prompt if none is given. The results of the query are added with the current contents of the worklist. MINUSWORK: This command expects a query and will prompt if none is given. The results of the query are subtracted from the current contents of the worklist. COPY: Emulates the NBRF-PIR PSQ COPY command. You are asked whether text information is required. 3.2 Qualifiers These are only available for use with the DISPLAY command /AUTHOR Enables author information display /ALTERNATIVE Enables alternative name display /COMMENT Enables comment display /INFO Displays results of subqueries /FULL Equivalent to /AUTHOR/ALTERNATIVE/COMMENT/SEQUENCE /OUTPUT=filename Sends screen information to a named file /PRINTER Sends screen information to the default printer /SEQUENCE Enables protein sequence display 3.3 Functions NOT TITLE TEXT SEQ CODE FSEQ SEQ: Searches the database for any sequences which exactly match a given peptide. If multiple parameters are given (e.g. seq "xxx yyy") then only those sequences are returned which contain all the peptides specified. TITLE: Searches the database titles for entries which contain the given text. Multiple parameters have the same interpretation as for the SEQ function. TEXT: This command searches all textual information in the database for entries which contain a given string. Multiple parameters (e.g. text "xxx yyy") have the same interpretation as for the SEQ function. CODE: Searches the database for given pcodes. Multiple pcodes can be given. FSEQ: Searches the database for fuzzy sequence matches. The parameter takes the form "probe mismatches". The 'probe' term contains the sequence with parentheses enclosing invariate residues. The 'mismatches' term is an integer giving the number of allowed mismatches in the sequence. For example FSEQ "a(c)def 1" will find those entries containing the above pentapeptide with one allowed mismatch in residues 'a','d','e' or 'f' (a 'c' must always be the second relative residue. Multiple parameters are not currently implemented for this function. 3.4 Operators AND OR XOR ADD (equivalence = 'or') SUBTRACT (equivalence = 'and not') 4. A strategy for creating database subsets from text queries This section describes the development, using DELPHOS, of an effective strategy for a typical problem of retrieval. The task is to... "Create a specialised database of all Class I and Class II Major Histocompatibility antigens, including any Tla/Qa or CD1 homologues." The objective is to define a set of DELPHOS commands that will allow such a database to be build FROM FUTURE OWL RELEASES without the need for database similarity searching at every new release. This task requires an initial stage of research into the retrieval power of both text parameters and sequence similarity searches. Once a strategy has been developed it can be applied very quickly to update the specialised subset database following each new release of OWL. This mini-database would have many research applications, for example in the construction of multiple sequence alignments and phylogenies. Detailed modelling of the structure and interactions of important parts of this set of homologous molecules, such as the peptide presentation groove, requires as much information as possible from sequence alignments about alternative amino acids at different positions in the structure and their functional effects. Sequence pattern discriminators could be more readily refined by rapid evaluation using the small subset of the OWL database. The information in the specialised database could also be readily extended, using the `browsing' facilities of DELPHOS to include, for example, other MHC-encoded polypeptides or proteins that interact with MHC Class I and Class II antigens. Particular problems arise in this case for achieving complete and specific retrieval of only the relevant protein entries. A few hundred of the Class I and Class II antigens have been sequenced. Retrieval by sequence homology is likely to be relatively non-specific because of many other homologues of individual domains of these antigens, for example immunoglobulins, beta-2-microglobulins and other immune system surface receptors. Retrieval by text strings is likely to be incomplete because of the numerous synonyms of MHC antigen names present in the component databases of OWL. These problems can be overcome by using (a) Connected combinations of many text search parameters (b) Using the list manipulations of DELPHOS to integrate information from both sequence similarity searches and text retrieval. 4.1 Gather information from sequence similarity searches A good approach is to run a typical Class I and a typical Class II protein through the SWEEP similarity searching program and gather the top 500 or so preliminary alignments. In this example more than 250 homologous Class I and Class II proteins will be present in the hit lists. Also, a few other homologues of the immunoglobulin domain will appear in the hit lists and (with OWL 7.0) predominated from position 270 downwards. Load the two lists into DELPHOS using the `hitstore' and `hitwork' commands and intersect using `andlists' to check homology between the two hitlists. The two lists can then be combined using `orlists'. Both the intersected and combined lists can be saved to disc. 4.2 Work out effective simple text discriminators for the homologues found by sequence similarity searching In the given example, it can be shown by experiment that many relevant text strings such as "mhc", "hla", "classi", "transplantation" and "antigen" show little specificity for the MHC proteins. For example, "mhc" is also the gene designation for "myosin heavy chain" and "hla" is a frequent substring of irrelevant words such as "chlamydia". It was found that greater retrieval efficiency was obtained with character strings from the word "histocompatibility", only a few (13 in OWL 7.0) did not contain a version of this word and again, only a few (25 in OWL 7.0) unwanted entries were retrieved. The truncated string "histocompat" gave more efficient retrieval than the full word because of the occurrence of the mis-spelled word "histocompatability" in some database entries. The words "histoco" and "istocom" gave identical results to "histocompat" but shorter strings were less discriminating. The diagnostic success of "histocompatibility", compared to other words, is a consequence of its appearance as a standard term in one or more textual fields of the NBRF, Swissprot and GenBank source databases. The concordance between different databases may be gratuitous: "histocompatibility" is used by NBRF in both titles and keywords, by Swissprot as a standard term in the feature tables and by GenBank as a keyword in most of the relevant entries. GenBank is internally inconsistent because some of the author-submitted entries lack the term. Such results demonstrate why a standard vocabulary ought to be adopted by source databases and also why DELPHOS free text searching scores over other query languages. The discriminatory power of "histocompat" DEPENDS on the free text ability since the keyword is not restricted to a common field, or any, field in the source databases. 4.3 See if there are more complex text discriminators which give better retrieval In the given example, although DELPHOS> display text "histocompat" gave 90% recall it was useful to try to achieve better retrieval using more complex queries to avoid source database problems of synonymy and inconsistent usage. The following alternative designations were commonly present in the titles of some of the Class I and Class II proteins: "Class I histocompatibility", "MHC class I", "MHC class II", "Class I MHC", "Class II MHC", "HLA class I", "HLA class II", "HLA-DR class II", "RLA class I", "RT1 class II", "H-2 class I" and "H-2 class II". There were also less frequent variations such as "Qa/Tla class I", "MHC HLA DQ", "H-2 L-D gene product", "Q7b antigen", "T1 antigen", "CD1 thymocyte" and "CD1 histocompatibility". These many synonyms, which contain key substrings in different sequences, can be diagnosed only with a relatively complex query. The example below was formulated as follows. First, the string "classi" was used since this was common to many of the synonyms. Second, the less frequent names were reduced to their shortest effective substrings. The most concise form of a DELPHOS expression to retrieve the great majority of the required proteins would be: DELPHOS> display (text "classi" and (text "mhc" or text "hla" or text "rt1" or text "h2c" or text "histocompat")) or (text "cd1" and (text "thymo" or text "histocompat")) or text "q7b" or text "tlantigen" Although theoretically very efficient in CPU time this query should be subdivided for clarity (as suggested in the tutorial) and possibly for computer memory restraints as the multiple `or' operations generate very large lists. The above query is far more readable as: DELPHOS> display text "classi mhc" or text "classi hla" or text "classi rt1" DELPHOS> pluswork text "classi h2c" or text "classi histocompat" DELPHOS> pluswork text "cd1 thymo" or text "cd1 histocompat" DELPHOS> pluswork text "q7b" or text "tlantigen" With OWL 7.0 this took only 23.1 seconds of cpu time on a micro VAX 3600 computer. Retrospective analysis showed that recall was almost complete with only 3 out of 299 relevant entries being missed. A few irrelevant proteins designated as MHC "Class II" had also been retrieved. Although MHC-encoded these are complement proteins and do not belong to the Class I and Class II set specified. They were removed using the command DELPHOS> minuswork text "classiiig" or text "classiiih" or text "classiiir" or text "glycoproteincd4" This procedure took only 13.6 seconds of cpu on the same computer. 4.3 Integrate the results from the sequence similarity search with those from the text retrieval It is good practice to save the worklist as you go along. In the HLA example, the `andlists' and `orlists' worklist from the SWEEP search were held in the files HLA_AND.WK and HLA_OR.WK respectively. The worklist resulting from the complex text discriminators shown in the last section was held in MHC_TEXT.WK First, you can identify all the entries common to the two methods of retrieval using DELPHOS by typing DELPHOS> workread MHC_TEXT.WK DELPHOS> storeread HLA_OR.WK DELPHOS> andlists DELPHOS> MHC_AND.WK In our case the result contained only 2 unwanted entries and missed only 24 giving good reassurance of the high concordance between the two independent methods of sequence similarity searching and text retrieval. Second, identify the entries differing in the two lists. The entries retrieved by sequence similarity but not by text retrieval were obtained by typing DELPHOS> storeread MHC_AND.WK DELPHOS> workread HLA_OR.WK DELPHOS> xorlists DELPHOS> worksave HLA_XOR.WK and then the entries obtained by text retrieval but not by similarity searching by typing DELPHOS> workread MHC_TEXT.WK DELPHOS> xorlists DELPHOS> worksave MHC_XOR.WK The resulting lists were valuable for completing the list of relevant entries and for identifying the remaining unwanted entries from the complex text discriminator list MHC_TEXT.WK. The relatively short lists derived from the XOR operations were quickly perused and new diagnostic character strings readily devised. Call the resulting worklist MHC.WK Some relevant entries might have been missed because of a combination of low sequence homology and anomalies in the text strings of the database annotations and titles. Such possibilities should be explored using the `browsing' the database using DELPHOS 4.4 Browse An effective method of browsing is to search the OWL database with text or title strings that are exploratory rather than diagnostic. Such strings may have relatively broad specificity for aspects of a range of related proteins or functions that might be cross-annotated in the textual fields of protein entries. In the context of MHC antigens, individual exploratory strings might include "mhc", "hla", "rla", "rt1" etc., or perhaps the names of likely authors such as "hoodl" (for "Hood, L."). To explore OWL for relevant entries not in our MHC.WK these strings were used one at a time and the results of each search compared with MHC.WK by typing e.g. DELPHOS> display text "mhc" DELPHOS> storework DELPHOS> workread MHC.WK DELPHOS> xorlists DELPHOS> andlists DELPHOS> display Only one more relevant entry was found with the diagnostic text of this Class I antigen being "CW1 antigen". This diagnostic text was added to the complex query and the result saved as the final list. This list was used to create the specialised database using the CREATEDB command. 4.5 Updating Once a retrieval strategy has been established for the creation of a specialised database, it can be applied quickly and repeatedly to update this database with each new release of OWL. The research process, as described above for the MHC proteins, is only necessary for the initial development and evaluation of the retrieval strategy. Ease of updating is crucial since the amount of protein data is doubling every 18 months. For some purposes a simplified strategy such as DELPHOS> createdb text "histocompat" is sufficient as it gave greater than 90% precision and recall for OWL 7.0 (January 1990). If completeness of retrieval is critical then more complex queries and expert assessment of the list entries, as described above, are required. Either way, DELPHOS queries typically take only a few seconds to perform on a VAX minicomputer and rarely greater than a minute for very complex queries. 5.0 Theory The context-free grammar of DELPHOS is defined by the replacement rules given below which defines the set of all regular DELPHOS expressions. <expression> ::= <action> | <action> <query> | <action> <filename> | NULL <action> ::= <command> | <command> <qualifier> <query> ::= <probe> | <query> # <probe> | (<query>) <probe> ::= <function> <parameter> Key. Metasymbols ::= 'is a' | 'or' Non-terminal symbols <command> <qualifier> <function> Terminal symbol # boolean operator A DELPHOS expression consists of a command (with optional qualifiers) followed by a query. Queries are any arbitrarily complex associations of function/parameter pairs separated by boolean operators. Subqueries may be parenthesised to any depth in order to force precedence. Actions may be abbreviated unless such an abbreviation produces ambiguity with another action; functions and boolean operators may not be abbreviated. Actions must be separated from queries by at least one space character. Similarly probes must be separated from boolean operators and functions separated from parameters by at least one space character. All parameters to functions are delimited by double quotation marks. Typical regular expressions are: DELPHOS is case-insensitive and ignores all punctuation in function parameters. DELPHOS internally converts all regular expressions to postfix. The conversion to reverse-polish notation removes all parentheses and makes the query unambiguous. A corollary is that unparenthesised queries are parsed from right to left. 6.0 References Pearson, W.R. and Lipman, D.J. (1988) PNAS USA 85, 2444-2448 George et al.