Glossary

This page provides a brief glossary of terms and abbreviations commonly encountered in bioinformatics. This glossary forms part of an online Guide to Molecular Sequence Analysis.

Some of these explanations are rather simplistic, in favour of brevity. Please refer to molecular biology text books for more comprehensive details.

Alu

A family of approx. 300 bp repetitive sequences, found dispersed throughout the human genome. Almost any 100 kb human nucleotide sequence will have Alu sequences within it.

Base Analogue

A chemical compound which is sufficiently similar to one of the nitrogenous bases normally found in DNA, that it can replace it. Base analogues may cause mutations, or be used in a modified PCR reaction (e.g. when sequencing)

Bioinformatics

The discipline of obtaining information about genomic or protein sequence data. This may involve similarity searches of databases, comparing your unidentified sequence to the sequences in a database, or making predictions about the sequence based on current knowledge of similar sequences. Databases are frequently made publically available through the Internet, or locally at your institution.

BLAST

A set of programs, used to perform fast similarity searches. Nucleotide sequences can be compared with nucleotide sequences in a database using BLASTN, for example. Complex statistics are applied to judge the significance of each match. Reported sequences may be homologous to, or related to the query sequence. The BLASTP program is used to search a protein database for a match against a query protein sequence. There are several other flavours of BLAST.

BLAST2

A newer release of BLAST. Allows for insertions or deletions in the sequences being aligned. Gapped alignments may be more biologically significant.

cDNA

Complementary DNA. DNA copies of the mRNA expressed in a specified tissue. cDNA sequencing has the advantage of only representing expressed genes. Since only ~3% of the vast quantity of DNA in the human genome are coding sequences, cDNA sequencing is particularly useful in certain situations. See EST.

CDS or cds

Coding sequence.

Clone

Population of identical cells or molecules (e.g. DNA), derived from a single ancestor.

Cloning Vector

A molecule that carries a foreign gene into a host, and allows/facilitates the multiplication of that gene in a host. When sequencing a gene that has been cloned using a cloning vector (rather than by PCR), care should be taken not to include the cloning vector sequence when performing similarity searches. Plasmids, cosmids, phagemids, YACs and PACs are example types of cloning vectors.

Consensus Sequence

A derived nucleotide sequence that represents a family of similar sequences. Each base in the consensus sequence corresponds to the base most frequently occuring at that position, in the real sequences.

Contig

A DNA sequence that overlaps with another contig. The full set of overlapping sequences (contigs) can be put together to obtain the sequence for a long region of DNA that cannot be sequenced in one run in a sequencing assay. Important in genetic mapping at the molecular level.

DNA Sequencing

The experimental process of determining the nucleotide sequence of a region of DNA. This is done by labelling each nucleotide (A, C, G or T) with either a radioactive or fluorescent marker which identifies it. There are several methods of applying this technology, each with their advantages and disadvantages. For more information, refer to a current text book. High throughput laboratories frequently use automated sequencers, which are capable of rapidly reading large numbers of templates. Sometimes, the sequences may be generated more quickly than they can be characterised.

Downstream

Toward the 3' end of a nucleotide sequence.

EMBL

European Molecular Biology Laboratories. Maintain the EMBL database, one of the major public sequence databases.

EMBnet

European Molecular Biology Network: http://www.embnet.org was established in 1988, and provides services including local molecular databases and software for molecular biologists in Europe. There are several large outposts of EMBnet, including EXPASY.

EST

See Expressed Sequence Tag

Exon

Coding region of DNA. See CDS.

Expressed Sequence Tag (EST)

Randomly selected, partial cDNA sequence; represents it's corresponding mRNA. dbEST is a large database of ESTs at GenBank, NCBI.

HGMP

Human Genome Mapping Project.
The UK HGMP Resource Centre is an academic institution in the UK which provides a number of services, including access to databases, mirrors of databases, and access to extensive services/software for registered academic users.

Intron

Non-coding region of DNA.

MMDB

Molecular Modelling Database. A taxonomy assigned database of PDB (see PDB) files, and related information.

NCBI

National Center for Biotechnology Information (USA). Created by the United States Congress in 1988, to develop information systems to support the biological research community.

NIH

National Institutes of Health (USA).

OMIM

Online Mendelian Inheritance in Man. Database of genetic diseases wuth references to molecular medicine, cell biology, biochemistry and clinical details of the diseases.

ORF

Open Reading Frame. A series of codons (base triplets) which can be translated into a protein. There are six potential reading frames of an unidentifed sequence; TBLASTN (see BLAST) transalates a nucleotide sequence in all six reading frames, into a protein, then attempts to align the results to sequeneces in a protein database, returning the results as a nucleotide sequence. The most likely reading frame can be identified using on-line software (e.g. ORF Finder).

Orthologue

Groups of genes or proteins from different organisms that have the same function, are said to be orthologous. There are numerous genes that have been conserved through evolutionary history. The protein products can be identified in yeast, a nematode worm and human cells, for example. It can be interesting to study the gene function in a worm, if you know that it has the same function in humans.

PDB

Brookhaven Protein Data Bank. A database and format of files which describe the 3D structure of a protein or nucleic acid, as determined by X-ray crystallography or nuclear magnetic resonance (NMR) imaging. The molecules described by the files are usually viewed locally by dedicated software, but can sometimes be visualised on the world wide web.

PIR

A database of translated GenBank nucleotide sequences. PIR is a redundant (see Redundancy) protein sequence database. The database is divided into four categories:

PIR1 - Classified and annotated.
PIR2 - Annotated.
PIR3 - Unverified.
PIR4 - Unencoded or untranslated.

Redundancy

The presence of more than one identical item represents redundancy. In bioinformatics, the term is used with reference to the sequences in a sequence database. If a database is described as being redundant, more than one identical (redundant) sequence may be found. If the database is said to be non-redundant (nr), the database managers have attempted to reduce the redundancy.
The term is ambiguous with reference to genetics, and as such, the degree of non-redundancy varies according to the database manager's interpretation of the term. One can argue whether or not two alleles of a locus defines the limit of redundancy, or whether the same locus in different, closely related organisms constitutes redundency. Non-redundant databases are, in some ways, superior, but are less complete. These factors should be taken into consideration when selecting a database to search.

Sequence Tagged Site

Short cDNA sequences of regions that have been physically mapped. STSs provide unique landmarks, or identifiers, throughout the genome. Useful as a framework for further sequencing.

STS

See Sequence Tagged Site

SWISS-PROT

A non-redundant (See Redundancy) protein sequence database. Thoroughly annotated and cross referenced. A subdivision is TrEMBL.

TrEMBL

A protein sequence database of Translated EMBL nucleotide sequences.

UniGene

Database of unique human genes, at NCBI. Entries are selected by near identical presence in GenBank and dbEST databases. The clusters of sequences produced are considered to represent a single gene.

Upstream

Toward the 5' end of a nucleotide sequence.

Top | Links | Home

Author: Andrew Louka
E-mail: Webmaster