Glossary
This page provides a brief glossary of terms and abbreviations commonly
encountered in bioinformatics. This glossary forms part of an online Guide to Molecular Sequence Analysis.
Some of these explanations are rather simplistic, in favour of brevity. Please refer to molecular biology
text books for more comprehensive details.
- Alu
- A family of approx. 300 bp repetitive sequences, found
dispersed throughout the human genome. Almost any 100 kb human
nucleotide sequence will have Alu sequences within it.
- Base Analogue
- A chemical compound which is sufficiently similar to
one of the nitrogenous bases normally found in DNA, that it can replace it.
Base analogues may cause mutations, or be used in a modified PCR reaction
(e.g. when sequencing)
- Bioinformatics
- The discipline of obtaining information about
genomic or protein sequence data. This may involve similarity searches of
databases, comparing your unidentified sequence to the sequences in a
database, or making predictions about the sequence based on current
knowledge of similar sequences. Databases are frequently made publically
available through the Internet, or locally at your institution.
- BLAST
- A set of programs, used to perform fast similarity searches.
Nucleotide sequences can be compared with nucleotide sequences in a
database using BLASTN, for example. Complex statistics are applied to
judge the significance of each match. Reported sequences may be homologous
to, or related to the query sequence. The BLASTP program is used to search
a protein database for a match against a query protein sequence. There are
several other flavours of BLAST.
- BLAST2
- A newer release of BLAST. Allows for insertions or
deletions in the sequences being aligned. Gapped alignments may be more
biologically significant.
- cDNA
- Complementary DNA. DNA copies of the mRNA expressed in a
specified tissue. cDNA sequencing has the advantage of only representing
expressed genes. Since only ~3% of the vast quantity of DNA in the human
genome are coding sequences, cDNA sequencing is particularly useful in
certain situations. See EST.
- CDS or cds
- Coding sequence.
- Clone
- Population of identical cells or molecules (e.g. DNA),
derived from a single ancestor.
- Cloning Vector
- A molecule that carries a foreign gene into a host,
and allows/facilitates the multiplication of that gene in a host. When
sequencing a gene that has been cloned using a cloning vector (rather than
by PCR), care should be taken not to include the cloning vector sequence
when performing similarity searches. Plasmids, cosmids, phagemids, YACs
and PACs are example types of cloning vectors.
- Consensus Sequence
- A derived nucleotide sequence that represents a
family of similar sequences. Each base in the consensus sequence
corresponds to the base most frequently occuring at that position, in the
real sequences.
- Contig
- A DNA sequence that overlaps with another contig. The full
set of overlapping sequences (contigs) can be put together to obtain the
sequence for a long region of DNA that cannot be sequenced in one run in a
sequencing assay. Important in genetic mapping at the molecular level.
- DNA Sequencing
- The experimental process of determining the
nucleotide sequence of a region of DNA. This is done by labelling each
nucleotide (A, C, G or T) with either a radioactive or fluorescent marker
which identifies it. There are several methods of applying this
technology, each with their advantages and disadvantages. For more
information, refer to a current text book. High throughput laboratories
frequently use automated sequencers, which are capable of rapidly reading
large numbers of templates. Sometimes, the sequences may be generated more
quickly than they can be characterised.
- Downstream
- Toward the 3' end of a nucleotide sequence.
- EMBL
- European Molecular Biology Laboratories. Maintain the EMBL
database, one of the major public sequence databases.
- EMBnet
- European Molecular Biology Network: http://www.embnet.org
was established in 1988, and provides services including local molecular
databases and software for molecular biologists in Europe. There are
several large outposts of EMBnet, including EXPASY.
- EST
- See Expressed Sequence Tag
- Exon
- Coding region of DNA. See CDS.
- Expressed Sequence Tag (EST)
- Randomly selected, partial cDNA
sequence; represents it's corresponding mRNA. dbEST is a large database of
ESTs at GenBank, NCBI.
- HGMP
- Human Genome Mapping Project.
The UK HGMP Resource Centre is an academic institution in the UK which
provides a number of services, including access to databases, mirrors of
databases, and access to extensive services/software for registered
academic users.
- Intron
- Non-coding region of DNA.
- MMDB
- Molecular Modelling Database. A taxonomy assigned database of
PDB (see PDB) files, and related information.
- NCBI
- National Center for Biotechnology Information (USA). Created
by the United States Congress in 1988, to develop information systems to
support the biological research community.
- NIH
- National Institutes of Health (USA).
- OMIM
- Online Mendelian Inheritance in Man. Database of genetic
diseases wuth references to molecular medicine, cell biology, biochemistry
and clinical details of the diseases.
- ORF
- Open Reading Frame. A series of codons (base triplets) which
can be translated into a protein. There are six potential reading frames
of an unidentifed sequence; TBLASTN (see BLAST) transalates a nucleotide
sequence in all six reading frames, into a protein, then attempts to align
the results to sequeneces in a protein database, returning the results as a
nucleotide sequence. The most likely reading frame can be identified using
on-line software (e.g. ORF Finder).
- Orthologue
- Groups of genes or proteins from different organisms
that have the same function, are said to be orthologous. There are
numerous genes that have been conserved through evolutionary history. The
protein products can be identified in yeast, a nematode worm and human
cells, for example. It can be interesting to study the gene function in a
worm, if you know that it has the same function in humans.
- PDB
- Brookhaven Protein Data Bank. A database and format of files
which describe the 3D structure of a protein or nucleic acid, as determined
by X-ray crystallography or nuclear magnetic resonance (NMR) imaging. The
molecules described by the files are usually viewed locally by dedicated
software, but can sometimes be visualised on the world wide web.
- PIR
- A database of translated GenBank nucleotide sequences. PIR is a
redundant (see Redundancy) protein sequence database. The database is
divided into four categories:
- PIR1 - Classified and annotated.
- PIR2 - Annotated.
- PIR3 - Unverified.
- PIR4 - Unencoded or untranslated.
- Redundancy
- The presence of more than one identical item represents
redundancy. In bioinformatics, the term is used with reference to the
sequences in a sequence database. If a database is described as being
redundant, more than one identical (redundant) sequence may be
found. If the database is said to be non-redundant (nr), the
database managers have attempted to reduce the redundancy.
The term is ambiguous with reference to genetics, and as such, the degree
of non-redundancy varies according to the database manager's interpretation
of the term. One can argue whether or not two alleles of a locus defines
the limit of redundancy, or whether the same locus in different, closely
related organisms constitutes redundency. Non-redundant databases are, in
some ways, superior, but are less complete. These factors should be taken
into consideration when selecting a database to search.
- Sequence Tagged Site
- Short cDNA sequences of regions that have been
physically mapped. STSs provide unique landmarks, or identifiers,
throughout the genome. Useful as a framework for further sequencing.
- STS
- See Sequence Tagged Site
- SWISS-PROT
- A non-redundant (See Redundancy) protein sequence
database. Thoroughly annotated and cross referenced. A subdivision is
TrEMBL.
- TrEMBL
- A protein sequence database of Translated EMBL nucleotide
sequences.
- UniGene
- Database of unique human genes, at NCBI. Entries are
selected by near identical presence in GenBank and dbEST databases. The
clusters of sequences produced are considered to represent a single gene.
- Upstream
- Toward the 5' end of a nucleotide sequence.
Top | Links | Home
Author: Andrew Louka
E-mail:
Webmaster