bioinf_org

Bioinformatics Frequently Asked Questions

Mail your questions to me, Damian Counsell, and I'll try to bring you answers. Alternatively mail your answers and I'll incorporate them.

Please note that I cannot answer individual specific queries---I am not a careers adviser. I am, however, happy to tackle questions of general interest to all visitors to the site.

I consider bioinformatics to be a special kind of engineering discipline---it certainly isn't a "pure" science. It has been enormously successful in its short existence and I think its successes have been the result of a practical and rigorous approach which I hope to encourage in anyone interested in entering the field.

This document is not a scientific paper or textbook (yet). You will find blunt opinions here. If you disagree with me about any of the following please tell me. I hope to learn a lot from your inevitable and welcome criticisms.

There is certainly one sense in which I consider myself a pure scientist: I'm open to rational persuasion.

Overview
Contents
Definitions

Definition of Bioinformatics

What is Bioinformatics?---The Tight Definition
What is Bioinformatics?---The Loose Definition

Definitions of Fields Related to Bioinformatics

What is Computational Biology?
What is Medical informatics/Medinformatics?
What is Cheminformatics?
What is Genomics?
What is Proteomics?
What is Pharmacogenomics?

How old is the discipline?

Resources

Can you recommend any bioinformatics books?

General introductions
Computational/Mathematical aspects of bioinformatics
Applying bioinformatics in biological research
Other lists of bioinformatics books

What bioinformatics sites are there?

Tutorials
Societies
Collections of Tools
Portals

Education

Where can I study bioinformatics?

...in Africa
...in the Americas
...in Asia
...in Australasia
...in Europe

...in the UK

Careers

How can I get involved?

I am a newbie
I am a biologist
I am a computer scientist
More general advice

Where can I find bioinformatics jobs?

Practical Tips

How can I find a sequence?

...I have a description.
...I have an accession number.
...I have another sequence.
...I'm not sure whether to use the defaults.

How can I align two sequences?
How can I predict the function of a gene (product)?
How can I predict the structure of a sequence?
How can I write up?

Glossary of bioinformatics terms

What is an alignment?
What is a DNA array?
What is a homologue?
What is a scoring matrix?

Acknowledgements

Questions
Links
Answers

Small Print

Author and licensing
Version control information

Definitions

Definition of Bioinformatics

Roughly, bioinformatics describes any use of computers to handle biological information. In practice the definition used by most people is narrower; bioinformatics to them is a synonym for "computational molecular biology"--- the use of computers to characterise the molecular components of living things.

What is Bioinformatics?---The Tight Definition

"Classical" bioinformatics

Fredj Tekaia at the Institut Pasteur offers this definition of bioinformatics:

"The mathematical, statistical and computing methods that aim to solve biological problems using DNA and amino acid sequences and related information."

Most biologists talk about "doing bioinformatics" when they use computers to store, retrieve, analyse or predict the composition or the structure of biomolecules. As computers become more powerful you could probably add simulate to this list of bioinformatics verbs. "Biomolecules" include your genetic material---nucleic acids---and the products of your genes: proteins. These are the concerns of "classical" bioinformatics, dealing primarily with sequence analysis.

It is a mathematically interesting property of most large biological molecules that they are polymers; ordered chains of simpler molecular modules called monomers. Think of them as beads or building blocks which, despite having different colours and shapes, all have the same thickness and the same way of connecting to one another. Each monomer molecule is of the same general class, but each kind of monomer has its own well-defined set of characteristics. Many monomer molecules can be joined together to form a single, far larger, macromolecule which has exquisitely specific informational content and/or chemical properties.

According to this scheme, the monomers in a given macromolecule of DNA or protein can be treated computationally as letters of an alphabet, put together in pre-programmed arrangements to carry messages or do work in a cell.

"New" bioinformatics

The greatest achievement of bioinformatics methods, the Human Genome Project, is currently being completed. Because of this the nature and priorities of bioinformatics research and applications are changing. People often talk portentously of our living in the "post-genomic" era. My personal view is that this will affect bioinformatics in several ways:

Now we possess multiple whole genomes we can look for differences and similarities between all the genes of multiple species. From such studies we can draw particular conclusions about species and general ones about evolution. This kind of science is often referred to as comparative genomics.
There are now technologies designed to measure the relative number of copies of a genetic message (levels of gene expression) at different stages in development or disease or in different tissues. Such technologies, such as DNA microarrays will grow in importance.
Other, more direct, large-scale ways of identifing gene functions and associations (for example yeast two-hybrid methods) will grow in significance and with them the accompanying bioinformatics of functional genomics.
There will be a general shift in emphasis (of sequence analysis especially) from genes themselves to gene products. This will lead to:

attempts to catalogue the activities and characterize interactions between all gene products (in humans): proteomics ).
attempts to crystallize and or predict the structures of all proteins (in humans): structural genomics.
fewer DNA double-helices in bad sci-fi movies.

What some people refer to as research or medical informatics, the management of all biomedical experimental data associated with particular molecules or patients---from mass spectroscopy, to in vitro assays to clinical side-effects---will move from the concern of those working in drug company and hospital I.T. (information technology) into the mainstream of cell and molecular biology and migrate from the commercial and clinical to academic sectors.

This FAQ concentrates on classical bioinformatics, but will, I hope, grow to cover more of the "post-genomic" aspects of the field. It is worth noting that all of the above non-classical areas of research depend upon established sequence analysis techniques.

Definitions of Fields Related to Bioinformatics

What is Computational Biology?

Computational biologists might object (please do), but, I find that people use "computational biology" when discussing that subset of bioinformatics (in the broadest sense) closest to the field of classical general biology.

Computational biologists interest themselves more with evolutionary, population and theoretical biology rather than cell and molecular biomedicine. It is inevitable that molecular biology is profoundly important in computational biology, but it is certainly not what computational biology is all about (see next paragraph). In these areas of computational biology it seems that computational biologists have tended to prefer statistical models for biological phenomena over physico-chemical ones. This is often wise...

One computational biologist (Paul J Schulte) did object to the above and makes the entirely valid point that this definition derives from a popular use of the term, rather than a correct one. Paul works on water flow in plant cells. He points out that biological fluid dynamics is a field of computational biology in itself. He argues that this, and any application of computing to biology, can be described as "computational biology" (see also the "loose" definition of bioinformatics below). Where we disagree, perhaps, is in the conclusion he draws from this---which I reproduce in full:

"Computational biology is not a "field", but an "approach" involving the use of computers to study biological processes and hence it is an area as diverse as biology itself."

Richard Durbin, Head of Informatics at the Wellcome Trust Sanger Institute, expressed an interesting opinion on this distinction in an interview:

"I do not think all biological computing is bioinformatics, e.g. mathematical modelling is not bioinformatics, even when connected with biology-related problems. In my opinion, bioinformatics has to do with management and the subsequent use of biological information, particular genetic information."

What is Medical Informatics?

The Medical Informatics FAQ (no relation) provides the following definition:

"Biomedical Informatics is an emerging discipline that has been defined as the study, invention, and implementation of structures and algorithms to improve communication, understanding and management of medical information."

Aamir Zakaria, the author of the FAQ, emphasises that medical informatics is more concerned with structures and algorithms for the manipulation of medical data, rather than with the data itself.

This suggests that one difference between bioinformatics and medical informatics as disciplines lies with their approaches to the data; there are bioinformaticists interested in the theory behind the manipulation of that data and there are bioinformatics scientists concerned with the data itself and its biological implications. (I believe that a good bioinformatics researcher should be interested in both of these aspects of the field.)

Medical informatics, for practical reasons, is more likely to deal with data obtained at "grosser" biological levels---that is information from super-cellular systems, right up to the population level---while most bioinformatics is concerned with information about cellular and biomolecular structures and systems.

On both of these points I'd be happy for any medical informatics specialists to correct me.

What is Cheminformatics?

The Web advertisement for Cambridge Healthtech Institute's Sixth Annual Cheminformatics conference describes the field thus:

"the combination of chemical synthesis, biological screening, and data-mining approaches used to guide drug discovery and development"

but this, again, sounds more like a field being identified by some of its most popular (and lucrative) activities, rather than by including all the diverse studies that come under its general heading.

The story of one of the most successful drugs of all time, penicillin, seems bizarre, but the way we discover and develop drugs even now has similarities, being the result of chance, observation and a lot of slow, intensive chemistry. Until recently, drug design always seemed doomed to continue to be a labour-intensive, trial-and-error process. The possibility of using information technology, to plan intelligently and to automate processes related to the chemical synthesis of possible therapeutic compounds is very exciting for chemists and biochemists. The rewards for bringing a drug to market more rapidly are huge, so naturally this is what a lot of cheminformatics works is about.

The span of academic cheminformatics is wide and is exemplified by the interests of the cheminiformatics groups at the Centre for Molecular and Biomolecular Informatics at the University of Nijmegen in the Netherlands. These interests include:

Synthesis Planning
Reaction and Structure Retrieval
3-D Structure Retrieval
Modelling
Computational Chemistry
Visualisation Tools and Utilities

Trinity University's Cheminformatics Web page, for another example, concerns itself with cheminformatics as the use of the Internet in chemistry.

What is Genomics?

[XXXX INSERT DEFINITION OF GENOMICS HERE]

What is Proteomics?

[XXXX INSERT DEFINITION PROTEOMICS HERE]

What is Pharmacogenomics?

[XXXX INSERT DEFINITION PHARMACOGENOMICS HERE]

Overview of most common bioinformatics programs

Everyday bioinformatics is done with sequence search programs like BLAST, sequence analysis programs, like the EMBOSS and Staden packages, structure prediction programs like THREADER or PHD or molecular imaging/modelling programs like RasMol and WHATIF.

Overview of most common bioinformatics technology

Currently, a lot of bioinformatics work is concerned with the technology of databases. These databases include both "public" repositories of gene data like GenBank or the Protein DataBank (the PDB), and private databases like those used by research groups involved in gene mapping projects or those held by biotech companies. Making such databases accessible via open standards like the Web is very important since consumers of bioinformatics data use a range of computer platforms: from the more powerful and forbidding UNIX boxes favoured by the developers and curators to the far friendlier Macs often found populating the labs of computer-wary biologists.

Databases of existing sequencing data can be used to identify homologues of new molecules that have been amplified and sequenced in the lab. The property of sharing a common ancestor, homology, can be a very powerful indicator in bioinformatics (see below).

Acquisition of sequence data

Bioinformatics tools can be used to obtain sequences of genes or proteins of interest, either from material obtained, labelled, prepared and examined in electric fields by individual researchers/groups or from repositories of sequences from previously investigated material.

Analysis of data

Both types of sequence can then be analysed in many ways with bioinformatics tools.

They can be assembled. Note that this is one of the occasions when the meaning of a biological term differs markedly from a computational one (see the amusing confusion over the issue at Web-based geek forum Slashdot). Computer scientists, banish from your mind any thought of assembly language. Sequencing can only be performed for relatively short stretches of a biomolecule and finished sequences are therefore prepared by arranging overlapping "reads" of monomers (single beads on a molecular chain) into a single continuous passage of "code". This is the bioinformatic sense of assembly.

They can be mapped (see note)---that is, their sequences can be parsed to find sites where so-called "restriction enzymes" will cut them.

They can be compared, usually by aligning corresponding segments and looking for matching and mismatching letters in their sequences. Genes or proteins which are sufficiently similar are likely to be related and are therefore said to be "homologous" to each other---the whole truth is rather more complicated than this. Such cousins are called "homologues".

If a homologue (a related molecule) exists then a newly discovered protein may be modelled---that is the three dimensional structure of the gene product can be predicted without doing laboratory experiments.

Bioinformatics is used in primer design. Primers are short sequences needed to make many copies of (amplify) a piece of DNA as used in PCR (the Polymerase Chain Reaction).

Bioinformatics is used to attempt to predict the function of actual gene products.

Information about the similarity, and, by implication, the relatedness of proteins is used to trace the "family trees" of different molecules through evolutionary time.

There are various other applications of computer analysis to sequence data, but, with so much raw data being generated by the Human Genome Project and other initiatives in biology, computers are presently essential for many biologists just to manage their day-to-day results

Molecular modelling / structural biology is a growing field which can be considered part of bioinformatics. There are, for example, tools which allow you (often via the Net) to make pretty good predictions of the secondary structure of proteins arising from a given amino acid sequence, often based on known "solved" structures and other sequenced molecules acquired by structural biologists.

Structural biologists use "bioinformatics" to handle the vast and complex data from X-ray crystallography, nuclear magnetic resonance (NMR) and electron microscopy investigations and create the 3-D models of molecules that seem to be everywhere in the media.

note

Unfortunately the word "map" is used in several different ways in biology/genetics/bioinformatics. The definition given above is the one most frequently used in this context, but a gene can be said to be "mapped" when its parent chromosome has been identified, when its physical or genetic distance from other genes is established and---less frequently---when the structure and locations of its various coding components (its "exons") are established.

What is Bioinformatics?---The Loose definition

There are other fields---for example medical imaging / image analysis which might be considered part of bioinformatics. There is also a whole other discipline of biologically-inspired computation; genetic algorithms, AI, neural networks. Often these areas interact in strange ways. Neural networks, inspired by crude models of the functioning of nerve cells in the brain, are used in a program called PHD to predict, surprisingly accurately, the secondary structures of proteins from their primary sequences.

What almost all bioinformatics has in common is the processing of large amounts of biologically-derived information, whether DNA sequences or breast X-rays.

How old is the discipline?

"How old is bioinformatics?" The answer to this one depends on which source you choose to read.

From T K Attwood and D J Parry-Smith's "Introduction to Bioinformatics", Prentice-Hall 1999 [Longman Higher Education; ISBN 0582327881]:

"The term bioinformatics is used to encompass almost all computer applications in biological sciences, but was originally coined in the mid-1980s for the analysis of biological sequence data."

From Mark S. Boguski's article in the "Trends Guide to Bioinformatics" Elsevier, Trends Supplement 1998 p1:

"The term `bioinformatics' is a relatively recent invention, not appearing in the literature until 1991 and then only in the context of the emergence of electronic publishing...
"...However, some of my role models when I was a graduate student (Margaret O. Dayhoff, Russell F. Doolittle, Walter M. Fitch and Andrew D. McLachlan) had been building databases, developing algorithms and making biological discoveries by sequence analysis since the 1960s---long before anyone thought to label this activity with a special term (if anything it was called `molecular evolution'). Even a relatively new kid on the block, the National Center for Biotechnology Information (NCBI), is celebrating its 10th anniversary this year, having been written into existence by US Congressman Claude Pepper and President Ronald Reagan in 1988. So bioinformatics has, in fact, been in existence for more than 30 years and is now middle-aged."

Resources

Can you recommend any bioinformatics books?

It's notoriously difficult to find any books on bioinformatics itself that cater well for all of those coming from computing, from mathematics and from biology backgrounds. The few textbooks available in the field tend to be eyewateringly expensive as well. I've divided suggested reading into books of general interest, those best suited to people coming from a computational/mathematical background and books for biologists interested in bioinformatics. After my suggestions are some links to other lists of bioinformatics books.

General introductions

Many people are curious about the Human Genome (Project). The completion of the first draft probably represents bioinformatics' coming of age as a discipline. The first couple of books are aimed at the intelligent layperson.

A gossipy and insightful account of the race to sequence the genome can be found in "The Sequence" by Kevin Davies [Weidenfeld; ISBN 0297646982]. Matt Ridley's "Genome" [Fourth Estate; ISBN 185702835X] is both an interesting layperson's introduction to the issues raised by the bioinformatic revolution and an overview of its biology and enormous scope. If I remember rightly, Ridley's book received a slightly snooty review from Walter Bodmer. This is understandable, since his and Robin McKie's excellent "pre-genomic" guide to the Human Genome Mapping Project, "The Book of Life" [Oxford Paperbacks; ISBN 0195114876] was undeservedly in a remainders bin when I bought my copy a couple of years ago.

If you are a non-biological scientist (or a non-scientist) and are hooked by these, why not go back to the "real beginning" of the race and read James Watson's entertaining and indiscreet memoir of his and Francis Crick's determination of the structure of DNA, "The Double Helix" [Penguin; ISBN 0140268774]---now updated with an introduction by media don Steve Jones.

Nigel Barber at Peterborough Regional College in the UK recommends Gary Zweiger's "Transducing the Genome" [McGraw-Hill Professional Publishing: ISBN 0071369805]. The summary at Amazon makes it sound a tad pretentious, but all the reviews seem pretty positive so it might be worth a read.

Computational/Mathematical aspects

If you are a hardcore maths/computing person Michael Waterman's "Introduction to Computational Biology" [Chapman & Hall/CRC Statistics and Mathematics; ISBN 0412993910] and Pavel Pevzner's "Computational Molecular Biology - An Algorithmic Approach" [The MIT Press (A Bradford Book); ISBN 0262161974] will give you all the discrete maths you can shake a stick at, but perfunctory introductions to the biology.

Bioinformatics.org's very own Jeff Bizzarro recommends Dan Gusfield's "Algorithms on Strings, Trees and Sequences" [Cambridge, 1997 ISBN 0-52158-519-8], Richard Durbin, S. Eddy, A. Krogh, G. Mitchison "Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids" [Cambridge, 1997 ISBN 0-52162-971-3] (which I think is one of the clearest and most comprehensive guides to alignment algorithms) and---for that full "computers-to-biology conversion"--- Geoffrey M. Cooper "The Cell: A Molecular Approach" [ASM Press, 1996 ISBN 0-87893-119-8]. Jeff Ames writes that a second edition of this book is now available [Sinauer Associates, Incorporated, 2000 ISBN 0-87893-106-6] and that this version---if you can find it in the shops---comes with a CD.

Applying bioinformatics to biological research

One outstanding new comprehensive text for the biologist is David W. Mount's "Bioinformatics" [Cold Spring Harbor Press; ISBN0879696087]. It's not cheap, but it's the best I've seen if you are studying bioinformatics itself.

If you're coming to the subject as a computer user with a biological background, looking to exploit the many tools available, you might want to try Terry Attwood and David Parry-Smith's "Introduction to Bioinformatics" [Longman Higher Education; ISBN 0582327881], or Des Higgins and Willie Taylor's "Bioinformatics: Sequence Structure and Databanks" [Oxford University Press; ISBN 0199637903]. Bioinformatics.org also recommends Cynthia Gibas and Per Jambeck's "Developing Bioinformatics Skills" [O'Reilly, 2001 ISBN 1-56592-664-1].

Stuart Brown recommends his own book "Bioinformatics: A Biologist's Guide to Biocomputing and the Internet" [Eaton Pub Co; ISBN: 188129918X]. If he sends me a review copy I might recommend it too ;-) .

Further suggestions for this section are welcome.

Other lists of bioinformatics books

See also compbiology.org's list and Steve Brenner's list.

What bioinformatics sites are there?

Directories

Christy Hightower, Engineering Librarian at the Science and Engineering Library, University of California Santa Cruz has already done this better than me. Visit her excellent article about bioinformatics Net resources in Issues in Science and Technology Librarianship.

Tutorials

A great place to start, whether you come from a biological, physical or computational background is at Martin Vingron's superb online bioinformatics tutorial. (Begin by choosing a section from the left-hand-side menu bar.)

Tom Smith and Don Emmeluth have produced a nice little exploration of bioinformatics using NCBI resources and tools. (I suspect that they might have a dry sense of humour too. If you visit the root page of this Web tree you will find a page of such comprehensively tasteless geekiness that you will either laugh yourself stupid or be put off bioinformatics for life.)

I recently stumbled upon a promising set of online lecture notes currently under construction by B. Steipe at the Genzentrum (Gene Center) at the Ludwig-Maximilians-Universität München (University of Munich).

Chemistry for all

A defiantly frames-free chemistry tutorial site.

Mathematics for biologists

First of all, an almost completely painless introduction to the horrors of the quadratic equation by Peter Whalen, James Walker, and Drew Marticorena.

C. J. Schwarz of the Department of Statistics and Acturial Science, Simon Fraser University has produced a course in "Statistics for the Life Sciences" which is accompanied by set of sound, online html handouts. They aren't the prettiest, but they'e some of the best. (Though his "paradigm of statistics" mnemonic "TRRGET" is completely inconsistent with his explanation of what the letters stand for... If anyone can enlighten me I'd be pleased to know what I'm failing to understand.)

Here is a great guide to a whole array of statistical learning/teaching resources prepared by Juha Puranen of the University of Helsinki (English).

Computers for biologists

Programming for biologists

General introduction to biology for computer scientists

Estrella Mountain Community College in the States offers this excellent short introduction to biology (actually "The Nature of Science and Biology". It's a great place for keyboard jockeys to start their journey to enlightenment.

Molecular biology for computer scientists

The Institute of Arable Crop Research Beginner's Guide to Molecular Biology

Protein chemistry for computer scientists

Unilever Education Advanced Series tutorial on proteins.

Cell biology for computer scientists

The University of Arizona has made available a high-quality tutorial in cell biology. Not only does it cover the facts, but it also attempts to introduce some of the philosophy of the field---recommended. Even better, it's also available en Español.

Once you've worked your way through that you might like to see some scanning electron microscope images of some of the structures you've read about taken by members of John Heuser's lab.

Evolution for computer scientists

Bob Patterson maintains his "Darwiniana" with amazing diligence.

Practical bioinformatics

Societies

Humberto Ortiz Zuazaga kindly introduced me to The International Society for Computational Biology which he points out "has links to programs of study and online courses in computational biology and to job postings".

Collections of Tools

I cannot recommend strongly enough the Human Genome Mapping Project Resource Centre's "GenomeWeb".

Of historical interest only now, I guess, is the legendary "Pedro's Molecular Biology Search and Analysis Tools".

Portals

CCP11 (Collaborative Computational Project 11) is another great product of the UK's Genome Campus. To quote their Web site, it was...

"...established to foster the broad bioinformatics community and the UK research community in particular. Its purpose is to facilitate the transfer of knowledge and expertise through conferences, workshops, a newsletter and the use of the world wide web. CCP11 is funded by the BBSRC and is hosted at the MRC Human Genome Mapping Project Resource Centre HGMP-RC located on the Wellcome Trust Genome Campus, Cambridge."

Jennifer Steinbachs runs compbiology.org which is a general computational biology site as well as being a portal to her own work.

BioPlanet is well worth visiting, though I have to say I have no idea who runs it or what its precise status (commercial, personal, for-fun) as a Web site is.

Education

Where can I study Bioinformatics?

This section is not complete, but contributions to broaden its coverage are welcome. Please do not direct questions about eligibility, course quality or admissions policy to me, but to ask the individual institutions directly. Use the links to obtain contact details. If an institution doesn't provide telephone numbers/email addresses or snailmail details on its Web site it doesn't deserve your patronage.

This resource focuses on complete, full-time degree programmes rather than on individual study modules. Curating a list of the latter would be a full-time job. You can go to other places, however, if you are looking for short courses. Thanks to various contributors, including Wentian Li who pointed me to this list at Rockefeller which is mirrored at various other sites. And to Humberto Ortiz Zuazaga for mailing me a link to the ICSB, where you can find this list. In the UK The Bioinformatics Resource (part of the BBSRC's CCP11 project) project maintains (among many other resources) lists of (mainly) British Masters and PhDs in bioinformatics. If you have any suggestions or updates please contact me with them. You can publicize your course and offer a public service at the same time.

Africa

South African National Bioinformatics Institute (SANBI) Honours Bioinformatics Course at the University of the Western Cape. Next year the same institute will be offering a Master's in bioinformatics---thanks to Cathal Seoighe.

If you know of any other bioinformatics courses on the African continent please feel free to mail me about them.

The Americas

Canada

The University of Waterloo, Department of Computer Science offers undergraduate and graduate courses in bioinformatics. More information is here.

California

In apparent contradiction to the the URL, the Keck Graduate Institute claims that computational biology is a core element of the curriculum in its Master of Bioscience degree.

Stanford University M.S./PhD. in BioMedical Informatics

Thanks to Momchil Georgiev for the information that the University of California at San Diego offers a Bioinformatics graduate programme and to Dana Brehm that there is now a new batchelor's program, to quote her:

"[This is an] undergraduate, interdisciplinary program for undergraduates leading to a B.S. degree. The new Bioinformatics major is offered by the Division of Biology, and the departments of Chemistry/Biochemistry, Computer Science and Engineering, and Bioengineering. A student may choose to major in Bioinformatics in any one of the four departments or division. The Division of Biology currently offers two Bioinformatics courses, and with the advent of the cross-disicplinary major, even more courses are going to be taught 2002-03 and 2003-04."

University of California, Irvine Informatics in Biology and Medicine

David Delong wrote to me to point out that the College of Natural and Agricultural Sciences at the University of California, Riverside is developing a "Center in Genomics and Bioinformatics" which will offer a PhD curriculum in genomics and bioinformatics from academic year 2001-2002 onwards.

Catherine Velazquez says that the University of California, Santa Cruz will start a new undergraduate BS course in bioinformatics in the fall of 2001. They also have made public their proposal for an MS in Bioinformatics.

Maine

The Jackson Lab, a World centre of mouse genome informatics offers a graduate training program.

Massachusetts

Boston University and North Eastern University offer a graduate programme in bioinformatics.

Mexico

At the National Autonomous University of Mexico a doctoral program in biomedical sciences is available. Their Computational Molecular Biology Group is here.

Minnesota

The University of Minnesota offers a graduate programme in bioinformatics.

New York State

Rochester Institute of Technology Bachelor's and Masters of Science in Bioinformatics

If you know of any other bioinformatics courses on the American continent please feel free to mail me about them.

North Carolina

The North Carolina State University Genomic Sciences program offers Masters and PhDs in Bioinformatics.

Virginia

The Virginia State University's Bioinformatics Institute offers graduate options in Bioinformatics.

Asia

India

Vaibhav Sinha wrote to tell me that the Institute of Bioinformatics and Applied Biotechnology (IBAB) in Bangalore is offering bioinformatics courses.

According to Rahul Agrawal, the Indian Institute of Technology Delhi, New Delhi provides courses in Biochemical Engineering and Biotechnology. He adds that another branch of the Institute, IIT Kharagpur also provides various courses in this area.

There is an Advanced (Graduate) Diploma in Bioinformatics in the Bioinformatics Centre at the Jawaharlal Nehru University.

Madurai Kamaraj University in Madurai, India claims to have been the first in the country to initiate a bioinformatics programme and advanced diploma in bioinformatics at its School of Biotechnology

The University of Pune, Maharashtra offers an Advanced Diploma in Bioinformatics at the Bioinformatics Centre, , India.

Singapore

The Bioinformatics Centre of the National University of Singapore offers Undergraduate and PhD programmes in conjunction with the life sciences departments and research institutions at NUS.

Lam Ah Wah wrote to tell me that the Nanyang Technological University (NTU) starts a BioInformatics undergraduate and part-time post-graduate MSc course in Jul 2002. Be warned: their Web site has hideous frame/window based "portal" which breaks half a dozen rules of good interface design. I couldn't find pages about the actual courses---perhaps you can?

If you know of any other bioinformatics courses is Asia please feel free to mail me about them.

Australasia

Australia

The Research School of Biological Sciences, at the Australian National University in Canberra offers PhD., MSc. and Honours programs in Bioinformatics.

You can obtain a Graduate Certificate in Bioinformatics from Curtin University of Technology in Western Australia.

As of 2001 Flinders University in Adelaide offers a Batchelor's of Science in Bioinformatics.

The Biochemistry Department of La Trobe University in Victoria also offers an undergraduate course in Bioinformatics.

The University of New South Wales in Sydney offers an undergraduate program in Bioinformatics.

Sydney University in New South Wales offers a Batchelor's of Science in Bioinformatics.

If you know of any other bioinformatics courses is Australasia please feel free to mail me about them.

Europe

Belgium

A consortium including nearly all the French-speaking universities of Belgium (Bruxelles, Liège, Louvain, Mons, Namur and Gembloux) is offering the "Inter-University DEA/DES (Master) in Bioinformatics".

The Department of Engineering at the Katholieke Universitiet of Leuvan offers Master of Bioinformatics degree.

Denmark

The Technical University of Denmark, Center for Biological Sequence Analysis offers MSc.-level and PhD.-level courses in bioinformatics.

Finland

The Finnish Graduate School in Computational Biology, Bioinformatics, and Biometry or "ComBi" is a joint venture of the University of Helsinki (English), the University of Turku (English) and the University of Tampere (English).

Eire (Ireland)

Accoding to James O. McInerney there will be details of the National University of Ireland's undergraduate degree course in bioinformatics and computational biology here. He also organizes a bioinformatics summer school.

Germany

The Technische Fakultät (Faculty of Technology) at Universität Bielefeld (Bielefeld University), offers a graduate programme in Bioinformatik (bioinformatics).

The Universität Tübingen (University of Tübingen) also offers Bioinformatik. Here are their own Frequently Asked Questions (in German only) about studying bioinformatics there.

The Netherlands (Holland)

The Centre for Molecular and Biomolecular Informatics (CMBI) at the University of Nijmegen offers a Master's degree in bioinformatics. This is a one or two year course leading to a degree with the formal title of "Master in Life Sciences", but the subtitle "Bioinformatics".

Sweden

Bjorn Olsson writes that, as well as a 4-year Master's Degree in Bioinformatics, the University of Skövde offers a number of short courses and allow computer science master's students to include bioinformatics in their degree. There is more information here.

Apart from this, adds Daniel Nilsson, there is only one other "pure" bioinformatics course in Sweden: the MSc in Bioinformatics Engineering in Uppsala. There are also opportunities to study bioinformatics on the "normal" biotech courses in Gothemburg Linköping and Umeå. The former, The School of Mathematical and Computing Sciences at Chalmers offers an MSc. programme in bioinformatics---thanks to Samuel Hargestam.

United Kingdom

In the UK, there are only two dedicated undergraduate courses in bioinformatics---one at the University of Birmingham and another at UMIST. A major problem is the desperate skills shortage in the area. Experts in the field can earn considerably more in high-status commercial or government research jobs than in universities---without having to dedicate time to teaching. Bioinformatics is the ideal postgraduate scientific subject, best suited to those who are already trained in one of its constituent disciplines.

Two pioneering university institutions are Birkbeck College in the University of London, a British centre with a proud tradition in educating working and/or mature students to the highest academic standards and a superb X-ray crystallography group and York University whose Department of Biology offers Masters courses and PhDs in both computational biology and biomolecular science. Other universities have bioinformatics groups actively involved in the teaching of their biology/molecular biology undergraduate courses, including, for example, courses at Leeds University where there are also MRes studentships available. Manchester University also teaches bioinformatics to its undergraduates as well as offering a taught MSc. course in the subject. University College London (UCL) also offers a final year undergraduate course: "Bioinformatics:Genes, Proteins and Computers".

Imperial College recently displaced Oxford (at least temporarily) from second place of various "charts" of the "best" universities in the UK. [Disclaimer: I was a graduate student at Imperial and teach on two graduate courses there.] From next year the Department of Biochemistry at Imperial is offering a new MSc in Computational Genetics and Bioinformatics. (Oxford itself hasn't yet deigned to recognize the field with a degree course. [Disclaimer: I was an undergraduate there.])

Thank you to David Parkinson for pointing out to me that for the past two years Sheffield Hallam University has offered an MSc/PGDip in Bioinformatics at its Graduate School in Science, Engineering and Technology.

Other UK Bioinformatics courses include:

the various graduate programmes offered by the University of Exeter MSc/MRes in Bioinformatics.

University of Glasgow MRes in Bioinformatics.

University of Liverpool M.Sc., Postgraduate Diploma and Postgraduate Certificate in Biosystems & Informatics

University of Nottingham Master of Philosophy in Molecular Biology with Bioinformatics

In April 2002 City University's Bioinformatics group is moving---along with its PhDs---to the University of Glasgow Department of Computer Science. . Thanks to Will Bachelor for alerting me to the existence of this group.

If you know of any other bioinformatics courses in Europe please feel free to mail me about them.

Careers

How can I get involved?

If you want to get involved in bioinformatics, now is an exciting time. I can honestly say this is one area of science where demand for skilled practitioners (and salaries) can be very high.

This section is opinionated, partly because there are people in the field, both computer scientists and biologists, who I would love to provoke (or convert). If you are a newcomer, and especially if you come from one of bioinformatics component pure disciplines, I hope my ranted warnings will help you to avoid the mistakes of your predecessors---and I write as one of the mistaken. David S. Roos put it well in his recent review in the journal Science:

"Lack of familiarity with the intellectual questions that motivate each side can also lead to misunderstandings. For example, writing a computer program that assembles overlapping expressed sequence tags (EST) sequences may be of great importance to the biologist without breaking any new ground in computer science. Similarly, proving that it is impossible to determine a globally optimal phylogenetic tree under certain conditions may constitute a significant finding in computer science, while being of little practical use to the biologist."

How can I get involved?---I am a "newbie"

Please read the education section above for information about some of the places you can currently study bioinformatics. Please do not direct questions about eligibility, course quality or admissions policy to me, but to ask the individual institutions directly.

If you are a high school student / sixth former, think about taking an interdisciplinary computational biology or bioinformatics bachelor's degree of the sort offered at, for example, Manchester University in the UK or UPenn in the States. Don't worry if you can't find a place on such a course or there isn't one nearby; perhaps the best way to approach this subject is from two sides. Do a batchelor's degree in one area while taking a healthy interest in the other---or (if you can afford to) complement a first degree in one part of the discipline with a second degree in the second.

If you already have a degree in a biological discipline there are similar Master's courses---both interdisciplinary (e.g. Birkbeck's in London) and conversion type courses---for biologists or others to learn computer science, for example.

If you are currently doing a computer science or biology PhD, try to take advantage of the opportunity to take courses in the "other" discipline.

How can I get involved?---I am a biologist

To a biologist I would say: take as many real computing courses as you can. It's important not just to learn a programming language, but also to learn the discipline of computing; to structure and document your work in a rigorous way. What courses you take might be directed by the kind of work you are interested in doing when you graduate---whether you see yourself supporting bioinformatics applications or building them. For the former you need all-round familiarity with the programs themselves and the hardware and software needed to run them---plus your existing understanding of biology. For the latter you need to learn a structured programming language and the principles of good program design---plus the ability to talk to and understand biologists.

Courses biologists might consider taking:

UNIX
Mathematics
Programming

How can I get involved?---I am a computational/quantitative scientist

One thing that I will emphasise repeatedly in this section is the simple value of doing some "proper" biological laboratory science. I have sat through talk after talk where a bioinformatics "scientist" describes in great detail how his (it's usually "his") whizzy new application of a trendy mathematical tool offers a supposed insight into a (sometimes supposed) biological problem. Nine times out of ten I know that this "solution" will never be so much as sneezed on by a practising biologist.

Quantitative scientists talk about their interest in studying some aspect of "God's mind". Biologists are interested in "Mother Nature's body". If you want to win Nature over you are going to have to meet her in the flesh. You are as likely to be useful to biologists working in isolation at the keyboard as you are to conceive with your clothes on. Desk-bound bioinformaticists have written code that has turned out to be popular with biologists, but almost always because they have collaborated with biologists.

Courses quantitative scientists might consider taking:

Molecular biology
Protein (bio)chemistry
Evolutionary biology

You might also like to peruse Cynthia Gibas's answers to similar questions from computational scientists on the O'Reilly Web site.

These damned biologists are making me use Word instead of LaTeX to write up---what can I do?

Try this.

More general advice

Use the software

Get access to an installation of EMBOSS and/or Staden and get someone to lead you through the tools available. RasMol is a simple, but powerful and elegant molecular imaging program which can teach you a great deal about biological macromolecules; try a tutorial. Get out on the Web and do some productive surfing for a change :-) . The best starting point is the Human Genome Mapping Project Resource Centre's "GenomeWeb". There's so much stuff out there -- and most of it is free to academics.

Where can I find Bioinformatics jobs?

Start with the appointments / careers sections of the the major scientific journals, or, better, search their Web jobs pages with "bioinformatics":

Appropriately for a Web-dependent discipline, there are a variety of specialist commercial Web sites which carry bioinformatics jobs:

There are also a number of companies actively recruiting in the area:

Genome Therapeutics Corporation

Practical tips

This section includes some simple rules-of-thumb to apply when performing common bioinformatics tasks. I try to give a reference to a more detailed source of guidance where I know of one.

How do I find a sequence?

The most common task in bioinformatics must be the acquisition of some bioinformatics data on which to operate. Usually this in the form of a nucleic acid or protein sequence, stored as characters in the appropriate alphabet together with a header of related information: for example some kind of unique identifying number the species from which the original biological substrate was obtained, the names of any authors who published the sequence and so on.

You may have already generated your own sequence data experimentally. In this case you are likely to want to find sequences which are identical or similar (and therefore possibly related) to yours. The task is then one of similarity search.

...I have a description.

A paradoxical problem generated by the success of the bioinformatics revolution is the increasing difficulty of navigating the huge amount of data available. Once you could print out most of the existing sequence databases onto paper and cram them into a single binder. Now a search for "actin" alone will pull out hundreds and hundreds of sequences. The key to find what you want is to develop your own discriminatory skills rather than rely on computers to figure out what it is you're really after.

Use PubMed

Make sure you are clear about your aim first. If you are looking for a sequence for a specific scientific purpose then you might be best to start with a relevant human-generated publication. For example, you have cloned a gene which is part of a well-characterised biochemical pathway and you want to find other sequences of the same functional gene product in other species (orthologues) PubMed is your friend. [XXXX CONTINUE DETAILED ADVICE HERE]

Use Swiss Prot

[XXXX INSERT DETAILED ADVICE HERE]

Use Boolean logic

[XXXX INSERT DETAILED ADVICE HERE]

Use cunning

[XXXX INSERT DETAILED ADVICE HERE]

...I have an accession number.

[XXXX INSERT DETAILED SEQUENCE ADVICE HERE]

...I have an another sequence.

This section will be expanded---and there will be a more basic and detailed explanation for novice searchers, but, in the meantime, here are the top tips cribbed from the excellent paper by Hugh B. Nicholas Jr., David W Deerfield II and Alexander J. Ropelewski in BioTechniques.

Use a local favourite program on the Web server of your choice.
Use at least two and preferably three similarity tables.
If using Smith-Waterman or FASTA algorithms ensure that the gap opening penalty is high enough.
If the initial search finds no or insufficient matches repeat it with a highly diverged matrix and/or with a Smith-Waterman-based server.
If this doesn't work try switching from a PAM matrix to a BLOSUM matrix.

...I'm not sure whether or not to use the defaults.

Hugh, David and Alexander again on when not to use the default search parameters provided by a server.

...when the homologues you are looking for to match your query are highly diverged.
...when the query or matches are short.
...when you are only interested in a specific (in the sense of "species") subset of database matches with a particular evolutionary relationship to your sequence of interest---a relationship not implied by the default settings.

How can I align two sequences?

This section will also be expanded for newbies, until then, here are Hugh, David and Alexander's tips for alignment:

Use an appropriately divergent matrix (I'll be adding a table soon to explain this).
Reduce your gap penalty relative to that you used for your database search.
Use the MaxSegs/Waterman-Eggert version of the dynamic programming algorithm to provide the best local alignment and also to search for repeats.

How can I predict the function of a gene (product)?

[XXXX INSERT FUNCTION PREDICTION ADVICE HERE]

How can I predict the structure of a sequence?

[XXXX INSERT STRUCTURE PREDICTION ADVICE HERE]

How can I write up?

Go here to download some detailed advice. Go here for more links.

Glossary of bioinformatics terms

Here I attempt to define some common terms in bioinformatics. I have tried to balance clarity, brevity and rigour. Let me know if I let one of these priorities over-ride the others.

What is an alignment?

When two symbolic representations of DNA or protein sequences are arranged next to one another so that their most similar elements are juxtaposed they are said to be aligned. Many bioinformatics tasks depend upon successful alignments. Alignments are conventionally shown as a traces.

In a symbolic sequence each base or residue monomer in each sequence is represented by a letter. The convention is to print the single-letter codes for the constituent monomers in order in a fixed font (from the N-most to C-most end of the protein sequence in question or from 5' to 3' of a nucleic acid molecule). This is based on the assumption that the combined monomers evenly spaced along the single dimension of the molecule's primary structure. From now on I shall refer to an alignment of two protein sequences.

Every element in a trace is either a match or a gap. Where a residue in one of two aligned sequences is identical to its counterpart in the other the corresponding amino-acid letter codes in the two sequences are vertically aligned in the trace: a match. When a residue in one sequence seems to have been deleted since the assumed divergence of the sequence from its counterpart, its "absence" is labelled by a dash in the derived sequence. When a residue appears to have been inserted to produce a longer sequence a dash appears opposite in the unaugmented sequence. Since these dashes represent "gaps" in one or other sequence, the action of inserting such spacers is known as gapping.

A deletion in one sequence is symmetric with an insertion in the other. When one sequence is gapped relative to another a deletion in sequence a can be seen as an insertion in sequence b. Indeed, the two types of mutation are referred to together as indels. If we imagine that at some point one of the sequences was identical to its primitive homologue, then a trace can represent the three ways divergence could occur (at that point).

Biological interpretation of an alignment

A trace can represent a substitution:

AKVAIL

AKIAIL

A trace can represent a deletion:

VCGMD

VCG-D

A trace can represent a insertion:

GS-K

GSGK

For obvious reasons I do not represent a silent mutation.

Traces may represent recent genetic changes which obscure older changes. Here I have only represented point mutations for simplicity. Actual mutations often insert or delete several residues.

What is a DNA array?

[Thanks to Bioinformatics.Org member Ravi Jain for the following answer, which I present verbatim.] DNA microarrays consist of thousands of immobilized DNA sequences present on a miniaturized surface the size of a business card or less. Arrays are used to analyze a sample for the presence of gene variations or mutations (genotyping), or for patterns of gene expression, performing the equivalent of ca. 5 000 to 10 000 individual "test tube" experiments in approximately two days of time.

Robotic technology is employed in the preparation of most arrays. The DNA sequences are bound to a surface such as a nylon membrane or glass slide at precisely defined locations on a grid. Using an alternate method, some arrays are produced using laser lithographic processes and are referred to as biochips or gene chips. The composition of DNA on the arrays is of two general types:

Oligonucleotides or DNA fragments (approximately 20-25 nucleotide bases). These arrays are frequently used in genotyping experiments. The sequences of alternate gene forms may be included for detection of mutations or normal variants (polymorphisms).
Complete or partial cDNA (approximately 500-5 000 nucleotide bases). These arrays are generally used for relative gene expression analysis of two or more samples; however, oligonucleotide-based arrays may also be used for these studies.

DNA samples are prepared from the cells or tissues of interest. For genotyping analysis, the sample is genomic DNA. For expression analysis, the sample is cDNA, DNA copies of RNA. The DNA samples are tagged with a radioactive or fluorescent label and applied to the array. Single stranded DNA will bind to a complementary strand of DNA. At positions on the array where the immobilized DNA recognizes a complementary DNA in the sample, binding or hybridization occurs. The labeled sample DNA marks the exact positions on the array where binding occurs, allowing automatic detection. The output consists of a list of hybridization events, indicating the presence or the relative abundance of specific DNA sequences that are present in the sample.

What is a homologue?

[INSERT FULL DEFINITION HERE.]

What is a scoring matrix?

[INSERT FULL DEFINITION HERE.]

Acknowledgements

Questions

Thanks to the following people for questions:

Jonathan Després
Salma B. Rafi
"Ritu"
Michael Wentzel

Answers

Thanks to the following people for suggesting answers:

Paul Boardman
Ravi Jain
Sangeeta Sawant
Fredj Tekaia

Small Print:

Author and licensing

This resource is maintained by and © Damian Counsell, UK Medical Research Council Human Genome Mapping Project Resource Centre (the HGMP-RC) 1998-2002. It is made available under a modified version of the Open Publication Licence. It is currently mirrored at The Bioinformatics Resource and at eBioinfogen

This resource has also been mirrored, without credit or any attempt to link to the Open Content Licence, at the so-called "National Bioinformatics Institute". If you are thinking of handing over money for their "certification" you can draw your own conclusions about their standing from this fact.

The first version of this resource was prepared when I was responsible for bioinformatics in the Section for Cell and Molecular Biology at the Institute of Cancer Research (the ICR) in London.

I am now a bioinformatics specialist at the HGMP-RC, part of the Proteomics Group and am supported by the Medical Research Council. This page does not represent their views, but I will happily read your criticisms. Although I may act on your advice I take no responsibility for anything that might happen if you browse here.

Version control information

$Revision: 1.92 $ $Date: 2002/04/14 20:58:55 $ $Author: counsell $

Bioinformatics Frequently Asked Questions

Overview

Contents

Definitions

Definition of Bioinformatics

What is Bioinformatics?---The Tight Definition

"Classical" bioinformatics

"New" bioinformatics

Definitions of Fields Related to Bioinformatics

What is Computational Biology?

What is Medical Informatics?

What is Cheminformatics?

What is Genomics?

What is Proteomics?

What is Pharmacogenomics?

Overview of most common bioinformatics programs

Overview of most common bioinformatics technology

Acquisition of sequence data

Analysis of data

note

What is Bioinformatics?---The Loose definition

How old is the discipline?

Resources

Can you recommend any bioinformatics books?

General introductions

Computational/Mathematical aspects

Applying bioinformatics to biological research

Other lists of bioinformatics books

What bioinformatics sites are there?

Directories

Tutorials

Chemistry for all

Mathematics for biologists

Computers for biologists

Programming for biologists

General introduction to biology for computer scientists

Molecular biology for computer scientists

Protein chemistry for computer scientists

Cell biology for computer scientists

Evolution for computer scientists

Practical bioinformatics

Other lists of bioinformatics tutorials

Societies

Collections of Tools

Portals

Education

Where can I study Bioinformatics?

Africa

The Americas

Canada

California

Connecticut

Georgia

Iowa

Maine

Massachusetts

Mexico

Minnesota

New York State

North Carolina

Virginia

Asia

India

Singapore

Australasia

Australia

Europe

Belgium

Denmark

Finland

Eire (Ireland)

Germany

The Netherlands (Holland)

Sweden

United Kingdom

Careers

How can I get involved?

How can I get involved?---I am a "newbie"

How can I get involved?---I am a biologist

Courses biologists might consider taking: