Hyunji's Bioinformatics Tutorial

Lecture on 'Sequence analysis' : 9:30am - 10:30am
Coffee break : 10:30am - 10:45am

I. Searching for sequences by homology : 10:45am - 11:00am + 5'

1.BLAST:Basic Local Alignment Search Tool

a) NCBI-BLAST [http://www.ncbi.nlm.nih.gov/BLAST/]
We will mainly focus on dealing with protein sequences and protein databases in this practical. Therefore, we try out 'blastp' first and then 'PSI-blast:Position Specific Iterated-blast'.

The two blast programmes are very widely used and they run in a very similar fashion. There, however, is a clear difference, in terms of how they actually execute inside, which is often hidde away to web-based users. While we try them out together, I will explain the difference.

b) WU-BLAST [http://www.ebi.ac.uk/blast2/]
This version of BLAST was developed and is maintained by Washington University, USA.

NCBI-BLAST and WU-BLAST are quite similar, in terms of the main algorithms involved, but they do display some minor differences, such as default-setting (e.g. whether filtering option is on or off), choices of substitution matrices you get (when you use web-based ones), and how you construct basic command-lines (should you prefer running them on a linux box).

2. FASTA

Fasta-Protein Similarity Search [http://www.ebi.ac.uk/fasta33/]
Another heuristic approach, like BLAST. Just letting you know that there is something other than BLAST, for your easy access.

3. MPsrch

MPsrch - Protein Database Query [http://www.ebi.ac.uk/MPsrch]
It may not have been as popular as the above two, it has certainly been around for sometime.

Rather than relying on heuristic approaches for the sake of speed and sensitivity, MPsrch actually sticks to the original local alignment algorithm, Smith-Waterman algorithm, for a maximum sensitivity. Due to the dynamic programming involved heavily, it overcomes the matter of 'speed' by fully appreciating 'parallel computing' I learnt.

*************************************************************************************
We use a potassium channel sequence, KcsA, as our input for this session.

>KcsA
mppmlsgllarlvklllgrhgsalhwraagaatvllvivllagsylavlaergapgaqli typralwwsvet
attvgygdlypvtlwgrlvavvvmvagitsfglvtaalatwfvgreqe rrghfvrhsekaaeeaytrt
tralherfdrlermlddnrr

*************************************************************************************

You can copy/paste it to input box, as provided by all the above three programmes. Try default-setting first and we will discuss when to choose which substitution-matrix and how different outputs can result from it.

This part may not return as straighforward as it should, whilst we are restricted by web only sessions, but we can still try to get a taste of it.

II. Multiple sequence alignment : 11:05am - 11:20am + 5'

1.ClustalW

CLUSTALW [http://www.ebi.ac.uk/clustalw/]
It constructs every possible combination of pair-wise sequence alignments, from your input sequences. Based upon similarities observed, it would build a simple tree first and then assemble those pair-wise alignments gradually, in order to return a final multiple sequence alignment to you. If your sequences are close to one another, it might be okay to rely on semi-automatic alignment tools like this. It can, however be slightly more challenging, should you bring a bunch of quite distantly related homologues. We will deal with how a situation as such can be handled, later on in session 3.

2.T-Coffee

T-COFFEE Tree-based Consistency Objective Function For alignment Evaluation [http://www.ebi.ac.uk/t-coffee/]
Another one of those progressive methods to build multiple sequence alignment. It, however, tries to overcome a hindsight seen in ClustalW, by introducing an intermediate stage between a complete set of all possible pair-wise alignments and a final multiple sequence alignment. It may seem rather equivalent to where you introduce restraints of your own, to enhance (or steer a little) homology models of your protein.

If clustalW can be said to be relying upon a guidance tree, in order to link individual pair-wise alignments to a final multiple alignment, t-coffee is trying to creat a bit more sophistication right into the middle process. 3DCoffee is supposedly allowing you to combine information from PDB, to improve your alignment. Some people do find T-coffee very useful -- it is just that you don't get to see how/why there should be much difference at all between clustalw and t-coffee, if you only use web-based versions. All you get to do is just to paste your sequences in FASTA format, and then that's it almost.

If you are really keen to understand when to use which one based upon what exactly, it would be a good idea to download a linux/unix version to your own workstation and to give it a little bit of tossing/turning around. Reading papers - written by the very developers - also helps. We will, however, stick to the web-version, this morning.

*3DCoffee: Combining Protein Sequences and Structures within Multiple Sequence Alignments. O. O'Sullivan, K Suhre, C. Abergel, D.G. Higgins, C. Notredame. Journal of Molecular Biology,Vol 340, pp385-395,2004

*T-Coffee: A novel method for multiple sequence alignments. C.Notredame, D. Higgins, J. Heringa, Journal of Molecular Biology,Vol 302, pp205-217,2000

3.Muscle

Muscle MUltiple Sequence Comparison by Log-Expectation, [http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py/]
Muscle is said to demostrate improved speed and accuracy, in comparison to clustalw and t-coffee. Once you open the box, what you get to see is still a progressive approach to multiple alignment. Shall we say iterated progressive alignment, as to why it should out-perform clustalw? By all means, you may find it not quite agreeable, it seems to me somewhere near 'blastp vs psi-blast'. I respect the author/developer, in particular, how he still appreciated and acknowledged clustalw in his original papers, whilst he was clearly presenting a new method supposedly doing far better.

Each method has its own merits, and it is up to you to decide how much you would like to learn about them, I mean really learning about them. Sometimes, an old and humble sword can shine in the hands of the true master (off the track, ok..) I still can not forget how Toby Gibson looked, sitting in his tiny office far down at the corridor, at the EMBL, when he was merely staring at a screenful of sequences presented in clustalw/x style. He was right there, lost in his own world, communicating with them. Thereafter, I just can not dismiss clustalw, that old way-too familiar gadget.
The point is that I am bringing my own subjective opinion undoubtedly, so take what you want to take from this session, and make up your own mind.

*Robert C Edgar MUSCLE: a multiple sequence alignment method with reduced time and space complexity BMC Bioinformatics 2004, 5:113.

*Edgar RC. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput Nucleic Acids Res. 2004 Mar 19;32(5):1792-7.

4.Editing your alignment

I am not intending to do anything about this, with you this morning. Just thought that you might find it useful later, when you feel like playing around with your almost-there sequence-alignment for presentation-purposes.

a) Jalview - a java multiple alignment editor [http://www.jalview.org]
b) Cinema -Colour INteractive Editor [http://www.biochem.ucl.ac.uk/bsm/dbbrowser/CINEMA2.02/index2.html]

*Parry-Smith, D.J., Payne, A.WR, Michie, A.D. and Attwood, T. K. (1997) CINEMA - A novel Colour INteractive Editor for Multiple Alignments. Gene, 211(2), GC45-56. View

**You input for alignment**

>KVAP_Kv
VMLTVLYGAFAIYIVEYPDPNSSIKSVFDALWWAVVTATTVGYGDVVPATPIGKVIGIAVMLTGISALTLLIGTVSNMF
>KcsA
MPPMLSGLLARLVKLLLGRHGSALHWRAAGAATVLLVIVLLAGSYLAVLAERGAPGAQLITYPRALWWSVETATTVGYGDLYPVTLWGRLVAVVVMVAGITSFGLVTAALATWFVGREQERRGH FVRHSEKAAEEAYTRTTRALHERFDRLERMLDDNRR
>Kv1.2Rat
ASMRELGLLIFFLFIGVILFSSAVYFAEADERDSQFPSIPDAFWWAVVSMTTVGYGDMVPTTIGGKIVGSLCAIAGVLTIALPVPVIVSN
>Kv1.2Human
ASMRELGLLIFFLFIGVILFSSAVYFAEADERESQFPSIPDAFWWAVVSMTTVGYGDMVPTTIGGKIVGSLCAIAGVLTIALPVPVIVSN
>Kir6.2Rat
MVWWLIAFAHGDLAPGEGTNVPCVTSIHSFSSAFLFSIEVQVTIGFGGRMVTEECPLAILILIVQNIVGLMINAIMLGCI
>Kir2.2Human
RYMLLIFSLAFLASWLLFGIIFWVIAVAHGDLEPAEGRGRTPCVMQVHGFMAAFLFSIETQTTIGYGLRCVTEECPVAVFMVVAQSIVGCIIDSFMIGAI
>H-ERG
RYSEYGAAVLFLLMCTFALIAHWLACIWYAIGNMEQPHMDSRIGWLHNLGDQIGKPYNSSGLGGPSIKDKYVTALYFTFSSLTSVGFGNVSPNTNSEKIFSICVMLIGSLMYASIFGNVS
>KCa1.1Human
VNLLSIFISTWLTAAGFIHLVENSGDPWENFQNNQALTYWECVYLLMVTMSTVGYGDVYAKTTLGRLFMVFFILGGLAMFASYVPEIIEL
>SK1Human
VMKTLMTICPGTVLLVFSISSWIIAAWTVRVCERYHDKQEVTSNFLGAMWLISITFLSIGYGDMVPHTYCGKGVCLLTGIMGAGCTALVVAVVARKLELT
>CNG1Human
LVMYIVIIIHWNACVFYSISKAIGFGNDTWVYPDINDPEFGRLARKYVYSLYWSTLTLTTIGETPPPVRDSEYVFVVVDFLIGVLIFATIVGNI
>AKT1_ARATH
CAKLVCVTLFAVHCAACFYYLIAARNSNPAKTWIGANVANFLEESLWMRYVTSMYWSITTLTTVGYGDLHPVNTKEMIFDIFYMLFNLGLTAYLIGNMTN

III. Profile alignment and pattern recognition : 11:25am - 11:40am + 5'

1. HMMer

HMMer 1 for general info [http://hmmer.janelia.org/] [http://hmmer.wustl.edu/]-old

HMMer 2 for one-off calculation [http://bioweb.pasteur.fr/seqanal/motif/hmmer-uk.html]

Hidden Markov Modeller, HMMer, it reads as in hammer, according to the original www. You don't have to get familiarised with this, if you are only sitting here, simply because you are funded by the Wellcome Trust. If you are, however, going to do any proper bioinformatics-related work, one way or another, you will be touching some sort of tool, which is essentially using profile HMM lurking right behind the main home-page or the front door (whichever you prefer).

*It is used, when you look for homologues with a greater sensitivity, e.g.) hmmsearch or psi-blast (i know, not quite so but it is similar in terms of dealing with position specific scoring matrix, PSSM, whereby emission probabilities would have taken place.)
*For improved profile-based alignment, e.g. hmmalign, which can compete with clustalw, t-coffee, and muscle, and will probably knock them off. However, nothing is perfect (sorry..), this can cause problems on occasion.

Depending on what you exactly want, sometimes, simple/straitforward is much better an answer. Once again, keep light-hearted throughout the course. You don't have to rank one by one for a perfect answer-sheet. If you can, just navigate a bit with free spirit (sorry, i guess off the track again), then see if you can develop your own feelings about their individualities/connections.

I think that there are already too many tools around, and with the current trend, there will only be more. I think what matters is that you stay afloat, and do not get overwhelmed. You don't have to look smart on every occasion whenever a new fancy tool pops up on the web/in pub-med. Instead, find the flow, the main stream, and read a bit, and continue. Am I talking non-sense? Most likely.. I guess i was hoping that you might find bioinformatics rather likeable. Have I failed you? My deepest apologies.

input and examples for HMMer

*R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological sequence analysis: probabilistic models of proteins and nucleic acids, Cambridge University Press, 1998.

2. PFAM

PFAM - a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families. [http://www.sanger.ac.uk/Software/Pfam/]

When you wish to find out what your protein might be doing, before conducting any experimental work, you would be interested in locating any interesting motif/pattern in your sequence which could have been well reported/extablished in literature. People have been building many dedicated protein databases over the years. Each database may differ, in terms of what sort of protein families they concentrate on, whether they are more interested in smaller patterns, such as enzyme active site, or binding motif, or something more spread-over, for instance, domain recognition.

Then, there was a point that experts in the field decided to put an orchestrated effort to bring the individual tools under one house-hold, namely, InterPro. Under there, you will find slightly more than a dozen of such programmes listed. All you do is that you copy/paste a sequence and then InterPro will search those databases automatically for you. Voila, you got a consensus annotation!

You will see that far more than half of the tools are somehow all linked to profile Hidden Markov Models. After all, once you have a tailor-made database of pHMMs representing the proteins you work with, all you need to do is one command 'hmmpfam' within HMMer package. That is where PFAM comes in strong, in this field. Somebody did all the work finding homologues for your protein family, making a seed alignment, building pHMM ('hmmbuild'), calibrating it ('hmmcalibrate'), finding more homologues using that pHMM just like why we use psi-blast for a finer hunting ('hmmsearch'), and finally building a large multiple sequence alignment representing the complete family ('hmmalign'). They just did it for nearly up to 10K (pfamA and pfamB). They deserve the credit, apparently.

When we sit/work in the lab, most of us focus on a limited number of protein families. Therefore, if we learn how to work with HMMer, it gives you a lot of freedom and power, should your work involve a large set of sequences. For this session, we will only try out a couple of commands, on a web-based HMMer, provided by Pasteur Institute. Such that you get to know what goes in and what comes out, nothing too deep.

*Pfam: clans, web tools and services Robert D. Finn, Jaina Mistry, Benjamin Schuster-Bckler, Sam Griffiths-Jones, Volker Hollich, Timo Lassmann, Simon Moxon, Mhairi Marshall, Ajay Khanna, Richard Durbin, Sean R. Eddy, Erik L. L. Sonnhammer and Alex Bateman Nucleic Acids Research (2006) Database Issue 34:D247-D251

IV. Phylogenetic analysis : 11:45am - 12:00pm + 5'

1. Calculating trees

a) PHYLIP (Almost known as the default program for generating phylogenetic trees.)

Phylip 1 for general info [http://evolution.genetics.washington.edu/phylip.html]

Phylip 2 for one-off calculation [http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html]

Once again, Pasteur Institute provides a good web-interface for runnning phylip, and they present an english version for almost every section on their web, otherwise my french would have failed me catastrophically.
You need to pay attention to what type of input required for each programme. When you run it on a linux box, there are a couple mandatory files to be present in your working directory, for a successful execution. BUT, you needn't worry here.

I explained, during the lecture, the basic principles involved in how a multiple sequence alignment can be transformed into a distance matrix, and further onto tree topologies and branch lengths. I, wholeheartedly, acknowledge that I am not at all an expert in this area; there are many statistical concepts to be fully appreciated. In reality, I can still go about calculating trees and telling which trees are better than others, without quite knowing what i am muttering about.. sad.. For what it's worth, I did read the original paper on the neighbour-joining method, more than a couple of times in the past. Pathetic it sounds, i know.. I promise that I will open my mouth for what i, at least, think I know, and also that, otherwise, my lips will remain sealed during the session. Oops.

1) Your alignment should be reformatted from *.aln to *.py for 'protdist' in PHYLIP. You will then get an output of a diagonal distance matrix, derived from your alignment. You can see why people make a fuss about alignment, because it is all over the places, basically.

2) That distance-matrix now becomes your new input for calculating trees via a) Neighbor-Joining (NJ) and UPGMA methods, b) Fitch-Margoliash and least-squares methods, and c) Fitch-Margoliash and least-squares methods with molecular clock. NJ method as in a) is often the basis of the progressive alignments that we learnt in session II. Multiple Sequence Alignment.
This means that those alignment tools use a tree built by NJ method as scaffold, to stack up bricks (pair-wise alignments), to complete its rough construction (a final multiple sequence alignment).

We will discuss why 'neighbour' (NJ method) runs faster than 'fitch' and 'kitsch' (Fitch- Margoliash and least-squares methods) and also why the former tends to produce more edged trees than the other two methods.

3) Now you can try 'drawtree' or 'drawgram' to see a graphical version of your tree from the mathematical version of trees that you saw in the output produced during step 2). You have seen examples, nearing the end of the lecture earlier.

b) Clustalw (The classic alignment tool)

- given sequences, it can take you to trees via alignment, all within the suite; easy and can be good, not for your final result with utter confidence. http://www.ebi.ac.uk/clustalw/
I only put this down here to show that there is more to clustalw than near automatic alignment. This had better be done on a linux/unix box, and I couldn't find any web-server letting me do all I do with clustalw locally. Phylip can, instead, cover almost all for this part. So don't worry, please.

2. Viewing/manipulating trees

This section is incorperated here, just to let you know what is available out there. The following three have been tried and tested with real data. The latter two can be run on a PC/Linux box. I won't try to handle them with you on the web, this morning. (had better be run locally.) Please do not hesitate to get in touch with me, should you get interested in 'TreeDyn', quite nice indeed.

a) PhyloTree (Colouring trees, nearly automatically, according to taxonomy.)

PhyloTree [http://www.ogic.ca/projects/phyloview/]
We may try this one out, only for viewing purposes. You need original tree files with an extension, *.ph or equivalent, not the graphical trees drawn by PHYLIP. They are ps files, and you will need to open them accordingly.

b) TreeView

Treeview [http://taxonomy.zoology.gla.ac.uk/rod/treeview.html] : This hyperlink isn't working, as of 23rd Apr. 2007.

c) TreeDyn (Graphical dynamics of handling a single large tree/multiple trees)

Treedyn [http://www.treedyn.org]

V. Protein secondary structure prediction : 12:05pm - 12:20pm + 5'

1.Transmembrane prediction server

Noj's script [http://sbcb.bioch.ox.ac.uk/TM_noj/TM_noj.html]

A consensus prediction for transmembrane helices. Dr. Jonathan Cuthbertson, at Structural Bioinfomatics and Computational Biochemistry Unit in Biochemistry at Oxford, developed the package. As noted by him, none of individual prediction tools were, originally, designed by him.

It is, however, very useful, since it nicely brings those components together, in a user-friendly environment. Similar examples can be taken from Jpred or InterPro. You have been introduced to a typical output of this server, during the lecture, and we will visit some of those individual prediction programmes together.

In particular,
TMHMM2.0: a predictor of transmembrane helices in proteins based on hidden Markov models, very fast, being based on only single sequence information.
*A. Krogh, B. Larsson, G. von Heijne, and E. L. L. Sonnhammer Journal of Molecular Biology, 305(3):567-580, January 2001

2.Pongo - a web server for multiple predictions of all-alpha transmembrane proteins.

Pongo [http://pongo.biocomp.unibo.it/pongo/]
This tool is part of an effort of 'all-alpha transmembrane proteins annotation' by A European Virtual Institute for Genome Annotation - Biosapiens Network.

It is another consensus prediction of transmembrane helices housing 6 component programmes. Mainly, they use hidden Markov models rather heavily and neural network analysis. According to their results found from Human and E.coli genomes, Pongo was able to bring down the percentage of membrane proteins to just around a quarter or lower, unlike the previous over 30% reported widely.

Its output is rather more pleasant/graphical and users are allowed to examine sequences by a window, such that you get an overall picture fast/clear and then dissect region by region. There can now be found some queues these days (used to be very available most of the time).

Contact information

Dr. Hyunji Kim
Department of Biochemisty
University of Oxford
South Parks Rd
Oxford
OX1 3QU
Great Britain

Tel: 01865 275380

Fax: 01865 275259

email: hyunji.kim@bioch.ox.ac.uk

last updated 23/04/07