a) NCBI-BLAST [http://www.ncbi.nlm.nih.gov/BLAST/]
We will mainly focus on dealing
with protein sequences and protein databases in this practical. Therefore, we try out 'blastp'
first and then 'PSI-blast:Position Specific Iterated-blast'.
The two blast programmes are very widely used and they run in a very similar fashion. There, however, is a clear difference, in terms of how they actually execute inside, which is often hidde away to web-based users. While we try them out together, I will explain the difference.
b) WU-BLAST [http://www.ebi.ac.uk/blast2/]
This version of BLAST was developed and is maintained by
Washington University, USA.
NCBI-BLAST and WU-BLAST are quite similar, in terms of the main algorithms involved, but they do display some minor differences, such as default-setting (e.g. whether filtering option is on or off), choices of substitution matrices you get (when you use web-based ones), and how you construct basic command-lines (should you prefer running them on a linux box).
Fasta-Protein Similarity Search [http://www.ebi.ac.uk/fasta33/]
Another heuristic approach,
like BLAST. Just letting you know that there is something other than BLAST, for your easy access.
MPsrch - Protein Database Query [http://www.ebi.ac.uk/MPsrch]
It may not have been as popular
as the above two, it has certainly been around for sometime.
Rather than relying on heuristic approaches for the sake of speed and sensitivity, MPsrch actually sticks to the original local alignment algorithm, Smith-Waterman algorithm, for a maximum sensitivity. Due to the dynamic programming involved heavily, it overcomes the matter of 'speed' by fully appreciating 'parallel computing' I learnt.
*************************************************************************************
You can copy/paste it to input box, as provided by all the above three programmes. Try default-setting first and we will discuss when to choose which substitution-matrix and how different outputs can result from it.
This part may not return as straighforward as it should, whilst we are restricted by web only sessions, but we can still try to get a taste of it.
T-COFFEE Tree-based Consistency Objective Function For alignment Evaluation
[http://www.ebi.ac.uk/t-coffee/]
Another one of those progressive methods to build multiple
sequence alignment. It, however, tries to overcome a hindsight seen in ClustalW, by introducing
an intermediate stage between a complete set of all possible pair-wise alignments and a final
multiple sequence alignment. It may seem rather equivalent to where you introduce restraints of your
own, to enhance (or steer a little) homology models of your protein.
*T-Coffee: A novel method for multiple sequence alignments.
C.Notredame, D. Higgins, J. Heringa,
Journal of Molecular Biology,Vol 302, pp205-217,2000
Muscle MUltiple Sequence Comparison by Log-Expectation, [http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py/]
Muscle is said to demostrate improved speed and accuracy, in comparison to clustalw and t-coffee.
Once you open the box, what you get to see is still a progressive approach to multiple alignment. Shall
we say iterated progressive alignment, as to why it should out-perform clustalw? By all means, you
may find it not quite agreeable, it seems to me somewhere near 'blastp vs psi-blast'. I respect the author/developer,
in particular, how he still appreciated and acknowledged clustalw in his original papers, whilst he was
clearly presenting a new method supposedly doing far better.
Each method has its own merits, and it is up to you to decide how much you would like to learn about them,
I mean really learning about them. Sometimes, an old and humble sword can shine in the hands of the true
master (off the track, ok..) I still can not forget how Toby Gibson looked, sitting in his tiny office
far down at the corridor, at the EMBL, when he was merely staring at a screenful of sequences presented
in clustalw/x style. He was right there, lost in his own world, communicating with them. Thereafter, I just
can not dismiss clustalw, that old way-too familiar gadget.
The point is that I am bringing my own subjective
opinion undoubtedly, so take what you want to take from this session, and make up your own mind.
*Robert C Edgar
MUSCLE: a multiple sequence alignment method with reduced time and space complexity
BMC Bioinformatics 2004, 5:113.
*Edgar RC. (2004)
MUSCLE: multiple sequence alignment with high accuracy and high throughput
Nucleic Acids Res. 2004 Mar 19;32(5):1792-7.
a) Jalview - a java multiple alignment editor [http://www.jalview.org]
b) Cinema -Colour INteractive Editor [http://www.biochem.ucl.ac.uk/bsm/dbbrowser/CINEMA2.02/index2.html]
*Parry-Smith, D.J., Payne, A.WR, Michie, A.D. and Attwood, T. K. (1997) CINEMA - A novel Colour INteractive Editor for Multiple Alignments. Gene, 211(2), GC45-56. View
**You input for alignment**
HMMer 2 for one-off calculation [http://bioweb.pasteur.fr/seqanal/motif/hmmer-uk.html]
Depending on what you exactly want, sometimes, simple/straitforward is much better an answer. Once again, keep light-hearted throughout the course. You don't have to rank one by one for a perfect answer-sheet. If you can, just navigate a bit with free spirit (sorry, i guess off the track again), then see if you can develop your own feelings about their individualities/connections.
input and examples for HMMer
*Pfam: clans, web tools and services Robert D. Finn, Jaina Mistry, Benjamin Schuster-Bckler, Sam Griffiths-Jones, Volker Hollich, Timo Lassmann, Simon Moxon, Mhairi Marshall, Ajay Khanna, Richard Durbin, Sean R. Eddy, Erik L. L. Sonnhammer and Alex Bateman Nucleic Acids Research (2006) Database Issue 34:D247-D251
Phylip 2
for one-off calculation [http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html]
I explained, during the lecture, the basic principles involved in how a multiple sequence alignment can be transformed
into a distance matrix, and further onto tree topologies and branch lengths. I, wholeheartedly,
acknowledge that I am not at all an expert in this area; there are many statistical concepts to be
fully appreciated. In reality, I can still go about calculating trees and telling which trees are better
than others, without quite knowing what i am muttering about.. sad.. For what it's worth, I did read
the original paper on the neighbour-joining method, more than a couple of times in the past. Pathetic
it sounds, i know.. I promise that I will open my mouth for what i, at least, think I know, and also
that, otherwise, my lips will remain sealed during the session. Oops.
1) Your alignment should be reformatted from *.aln to *.py for 'protdist' in PHYLIP.
You will then get an output of a diagonal distance matrix, derived from your alignment. You can
see why people make a fuss about alignment, because it is all over the places, basically.
2) That distance-matrix now becomes your new input for calculating trees via a) Neighbor-Joining (NJ)
and UPGMA methods, b) Fitch-Margoliash and least-squares methods, and c) Fitch-Margoliash and
least-squares methods with molecular clock. NJ method as in a) is often the basis of the progressive
alignments that we learnt in session II. Multiple Sequence Alignment.
This means that those alignment
tools use a tree built by NJ method as scaffold, to stack up bricks (pair-wise alignments), to complete
its rough construction (a final multiple sequence alignment).
We will discuss why 'neighbour' (NJ method) runs faster than 'fitch' and 'kitsch' (Fitch-
Margoliash and least-squares methods) and also why the former tends to produce more edged trees than
the other two methods.
3) Now you can try 'drawtree' or 'drawgram' to see a graphical version of your tree from the mathematical version of trees that you saw in the output produced during step 2). You have seen examples, nearing the end of the lecture earlier.
Noj's script [http://sbcb.bioch.ox.ac.uk/TM_noj/TM_noj.html]
A consensus prediction for transmembrane helices. Dr. Jonathan Cuthbertson, at Structural Bioinfomatics and Computational Biochemistry Unit in Biochemistry at Oxford, developed the package. As noted by him, none of individual prediction tools were, originally, designed by him.It is, however, very useful, since it nicely brings those components together, in a user-friendly environment. Similar examples can be taken from Jpred or InterPro. You have been introduced to a typical output of this server, during the lecture, and we will visit some of those individual prediction programmes together.
In particular,
TMHMM2.0: a predictor of transmembrane helices in proteins based on hidden
Markov models, very fast, being based on only single sequence
information.
*A. Krogh, B. Larsson, G. von Heijne, and E. L. L. Sonnhammer
Journal of Molecular Biology, 305(3):567-580, January 2001
Pongo [http://pongo.biocomp.unibo.it/pongo/]
This tool is part of an effort of 'all-alpha transmembrane proteins annotation' by
A European Virtual Institute for Genome Annotation - Biosapiens Network.
It is another consensus prediction of transmembrane helices housing 6 component programmes. Mainly, they use hidden Markov models rather heavily and neural network analysis. According to their results found from Human and E.coli genomes, Pongo was able to bring down the percentage of membrane proteins to just around a quarter or lower, unlike the previous over 30% reported widely.
Its output is rather more pleasant/graphical and users are allowed to examine sequences by a window, such that you get an overall picture fast/clear and then dissect region by region. There can now be found some queues these days (used to be very available most of the time).
Dr. Hyunji Kim
Tel: 01865 275380
Fax: 01865 275259
email: hyunji.kim@bioch.ox.ac.uk