Glossary
This page provides a brief glossary of terms and abbreviations commonly 
encountered in bioinformatics.  This glossary forms part of an online Guide to Molecular Sequence Analysis.
Some of these explanations are rather simplistic, in favour of brevity.  Please refer to molecular biology 
text books for more comprehensive details.
	
	- Alu
- A family of  approx. 300 bp repetitive sequences,  found 
	dispersed throughout the human genome.  Almost any 100 kb human 
	nucleotide sequence will have Alu sequences within it.
	
	
	 
- Base Analogue 
- A chemical compound which is sufficiently similar to 
	one of the nitrogenous bases normally found in DNA, that it can replace it.  
	Base analogues may cause mutations, or be used in a modified PCR reaction 
	(e.g. when sequencing)
	
	
	 
- Bioinformatics 
- The discipline of obtaining information about 
	genomic or protein sequence data.  This may involve similarity searches of 
	databases, comparing your unidentified sequence to the sequences in a 
	database, or making predictions about the sequence based on current 
	knowledge of similar sequences.  Databases are frequently made publically 
	available through the Internet, or locally at your institution.
	
	
	 
- BLAST 
- A set of programs, used to perform fast similarity searches. 
	Nucleotide sequences can be compared with nucleotide sequences in a 
	database using BLASTN, for example.  Complex statistics are applied to 
	judge the significance of each match.  Reported sequences may be homologous 
	to, or related to the query sequence. The BLASTP program is used to search 
	a protein database for a match against a query protein sequence.  There are 
	several other flavours of BLAST.
	
	
	 
- BLAST2 
- A newer release of BLAST.  Allows for insertions or 
	deletions in the sequences being aligned.  Gapped alignments may be more 
	biologically significant.
	
	
	 
- cDNA 
- Complementary DNA.  DNA copies of the mRNA expressed in a 
	specified tissue.  cDNA sequencing has the advantage of only representing 
	expressed genes.  Since only ~3% of the vast quantity of DNA in the human 
	genome are coding sequences, cDNA sequencing is particularly useful in 
	certain situations.  See EST.
	
	
	 
- CDS or cds 
- Coding sequence.
	
	
	 
- Clone 
- Population of identical cells or molecules (e.g.  DNA), 
	derived from a single ancestor.
	
	
	 
- Cloning Vector 
- A molecule that carries a foreign gene into a host, 
	and allows/facilitates the multiplication of that gene in a host.  When 
	sequencing a gene that has been cloned using a cloning vector (rather than 
	by PCR), care should be taken not to include the cloning vector sequence 
	when performing similarity searches.  Plasmids, cosmids, phagemids, YACs 
	and PACs are example types of cloning vectors.
	
	
	 
- Consensus Sequence 
- A derived nucleotide sequence that represents a 
	family of similar sequences. Each base in the consensus sequence 
	corresponds to the base most frequently occuring at that position, in the 
	real sequences.
	
	
	
	 
- Contig 
- A DNA sequence that overlaps with another contig.  The full 
	set of overlapping sequences (contigs) can be put together to obtain the 
	sequence for a long region of DNA that cannot be sequenced in one run in a 
	sequencing assay.  Important in genetic mapping at the molecular level.
	
	
	 
- DNA Sequencing 
- The experimental process of determining the 
	nucleotide sequence of a region of DNA.  This is done by labelling each 
	nucleotide (A, C, G or T) with either a radioactive or fluorescent marker 
	which identifies it.  There are several methods of applying this 
	technology, each with their advantages and disadvantages.  For more 
	information, refer to a current text book.  High throughput laboratories 
	frequently use automated sequencers, which are capable of rapidly reading 
	large numbers of templates.  Sometimes, the sequences may be generated more 
	quickly than they can be characterised.
	
	
	 
- Downstream 
- Toward the 3' end of a nucleotide sequence.
	
	
	 
- EMBL 
- European Molecular Biology Laboratories.  Maintain the EMBL 
	database, one of the major public sequence databases.
	
	
	 
- EMBnet 
- European Molecular Biology Network:  http://www.embnet.org 
	was established in 1988, and provides services including local molecular 
	databases and software for molecular biologists in Europe.  There are 
	several large outposts of EMBnet, including EXPASY.
	
- EST 
- See Expressed Sequence Tag
	
	 
- Exon 
- Coding region of DNA.  See CDS.
	
	
	 
- Expressed Sequence Tag (EST) 
- Randomly selected, partial cDNA 
	sequence; represents it's corresponding mRNA.  dbEST is a large database of 
	ESTs at GenBank, NCBI.
	
	
	 
- HGMP 
- Human Genome Mapping Project. 
 The UK HGMP Resource Centre is an academic institution in the UK which 
	provides a number of services, including access to databases, mirrors of 
	databases, and access to extensive services/software for registered 
	academic users.
	 
- Intron 
- Non-coding region of DNA.
	
	
	 
- MMDB 
- Molecular Modelling Database.  A taxonomy assigned database of 
	PDB (see PDB) files, and related information.
	
	
	 
- NCBI 
- National Center for Biotechnology Information (USA).  Created 
	by the United States Congress in 1988, to develop information systems to 
	support the biological research community.
	
	
	 
- NIH 
- National Institutes of Health (USA).
	
	
	 
- OMIM 
- Online Mendelian Inheritance in Man.  Database of genetic 
	diseases wuth references to molecular medicine, cell biology, biochemistry 
	and clinical details of the diseases.
	
	
	 
- ORF 
- Open Reading Frame.  A series of codons (base triplets) which 
	can be translated into a protein.  There are six potential reading frames 
	of an unidentifed sequence; TBLASTN (see BLAST) transalates a nucleotide 
	sequence in all six reading frames, into a protein, then attempts to align 
	the results to sequeneces in a protein database, returning the results as a 
	nucleotide sequence.  The most likely reading frame can be identified using 
	on-line software (e.g. ORF Finder).
	
	
	
	 
- Orthologue 
- Groups of genes or proteins from different organisms 
	that have the same function, are said to be orthologous.  There are 
	numerous genes that have been conserved through evolutionary history.  The 
	protein products can be identified in yeast, a nematode worm and human 
	cells, for example.  It can be interesting to study the gene function in a 
	worm, if you know that it has the same function in humans.
	
	
	 
- PDB 
- Brookhaven Protein Data Bank.  A database and format of files 
	which describe the 3D structure of a protein or nucleic acid, as determined 
	by X-ray crystallography or nuclear magnetic resonance (NMR) imaging.  The 
	molecules described by the files are usually viewed locally by dedicated 
	software, but can sometimes be visualised on the world wide web.
	
	
	 
- PIR 
- A database of translated GenBank nucleotide sequences. PIR is a 
	redundant (see Redundancy) protein sequence database. The database is 
	divided into four categories:
	
		-  PIR1 - Classified and annotated.
		
-  PIR2 - Annotated.
		
-  PIR3 - Unverified.
		
-  PIR4 - Unencoded or untranslated.
	
 
	
	 
- Redundancy 
- The presence of more than one identical item represents 
	redundancy.  In bioinformatics, the term is used with reference to the 
	sequences in a sequence database. If a database is described as being 
	redundant, more than one identical (redundant) sequence may be 
	found.  If the database is said to be non-redundant (nr), the 
	database managers have attempted to reduce the redundancy. 
 The term is ambiguous with reference to genetics, and as such, the degree 
	of non-redundancy varies according to the database manager's interpretation 
	of the term.  One can argue whether or not two alleles of a locus defines 
	the limit of redundancy, or whether the same locus in different, closely 
	related organisms constitutes redundency. Non-redundant databases are, in 
	some ways, superior, but are less complete. These factors should be taken 
	into consideration when selecting a database to search.
	 
- Sequence Tagged Site 
- Short cDNA sequences of regions that have been 
	physically mapped.  STSs provide unique landmarks, or identifiers, 
	throughout the genome.  Useful as a framework for further sequencing.
	
	
	 
- STS 
- See Sequence Tagged Site
	
	
	 
- SWISS-PROT 
- A non-redundant (See Redundancy) protein sequence 
	database. Thoroughly annotated and cross referenced.  A subdivision is 
	TrEMBL.
	
	
	 
- TrEMBL 
- A protein sequence database of Translated EMBL nucleotide 
	sequences.
	
	
	 
- UniGene 
- Database of unique human genes, at NCBI.  Entries are 
	selected by near identical presence in GenBank and dbEST databases.  The 
	clusters of sequences produced are considered to represent a single gene.
	
	
	 
- Upstream 
- Toward the 5' end of a nucleotide sequence.
Top | Links | Home
Author: Andrew Louka
E-mail:
Webmaster