Phylogenetic Analysis Assignment
Introduction
Molecular evolution is the study of how proteins and nucleic acids evolve. Included in this field are studies of mutations and chromosomal rearrangements, the evolutionary process, the identification of sequence patterns conferring function in proteins and nucleic acids, and the reconstruction of the evolutionary history of organisms and the molecules that they make. All of these studies rely on comparisons of nucleotide or amino acid sequences.
In this tutorial, you will be introduced to some of the fundamental principles of molecular evolution and the types of bioinformatics tools that are used in evolutionary studies. We will begin by carrying out a manual sequence comparison, so that the basic concepts can be introduced, and the remainder of the project will be carried out at The Biology Workbench, a set of bioinformatics analysis programs managed by The San Diego Supercomputing Center at the University of California, San Diego.
Objectives
• To introduce the principles of molecular evolution
• To acquaint you with the tools that are available to compare nucleotide and amino acid sequences
• To learn about the use of protein sequences in reconstructions of evolutionary history
Project
Branching evolution occurs when one ancestral species gives rise to two or more progeny species. However, speciation events don’t involve the vast majority of the genes in a genome. That is, for most genes, both of the progeny species inherit identical genes from the ancestor. Following speciation, these genes evolve independently in the separate lineages. Studies of molecular evolution therefore rely heavily on comparisons of related sequences from different organisms.
Shown below is an alignment of two homologous sequences that we will use as a starting place. Homologous sequences are sequences that have descended from a common ancestral sequence. You can’t meaningfully compare sequences unless they are homologous. This alignment uses the single letter amino acid code, in which G represents glycine, Q represents glutamine, etc. The aligned proteins have been shown to be involved in the metabolism of similar, but different, toxic compounds. As you can see, these amino acid sequences are very similar and it is easy to recognize that they are related by common descent.
dntAc: KMGVDDEVIVSRQNDGSVR nahAc: KMGIDDEVIVSRQSDGSIR
An expanded version of this alignment is shown below. In this expanded alignment, both the amino acids and the corresponding DNA nucleotides are shown. For ease of analysis, the codons have been broken into separate entries in a table.
Alignment of nahAc and dntAc sequences.
|
K |
M |
G |
V |
D |
E |
V |
I |
V |
dntAc |
AAA |
ATG |
GGC |
GTC |
GAT |
GAA |
GTC |
ATC |
GTC |
nahAc |
AAA |
ATG |
GGT |
ATT |
GAC |
GAG |
GTC |
ATC |
GTC |
|
K |
M |
G |
I |
D |
E |
V |
I |
V |
|
|
|
|
|
|
|
|
|
|
|
S |
R |
Q |
N |
D |
G |
S |
V |
R |
dntAc |
TCC |
CGC |
CAG |
AAC |
GAT |
GGC |
TCG |
GTG |
CGA |
nahAc |
TCT |
CGG |
CAG |
AGC |
GAC |
GGT |
TCG |
ATT |
CGT |
|
S |
R |
Q |
S |
D |
G |
S |
I |
R |
This region was chosen at random to represent the changes that take place in nucleotide sequences over time.
Answer the questions below by manually comparing these sequences (this section is for your own understanding, you do not need to turn this in.)
1. Assuming that the dntAc sequence represents the ancestral sequence, how many nucleotide changes (mutations) have occurred in this region to create the nahAc nucleotide sequence? Remember that in actuality neither sequence represents the ancestral sequence.
2. Of these nucleotide changes, how many of these changed the amino acid encoded by that codon (i.e, how many were nonsynonymous changes)?
3. How many nucleotide changes were in the first codon position? How many of these altered the encoded amino acid?
4. How many nucleotide changes were in the second codon position? How many of these altered the encoded amino acid?
5. How many of the nucleotide changes were in the third codon position? How many of these altered the encoded amino acid?
6. Compare the % identity of these two sequences at the nucleotide vs protein level. Percent identity is equal to (# of positions in common / total # positions) * 100.
Nucleotide % identity
Amino acid % identity
7. Why is there a difference between amino acid % identity and nucleotide percent identity?
If needed, a table of the single letter amino acid code can be found at: http://umber.sbs.man.ac.uk/dbbrowser/bioactivity/aacodefrm.html
If needed, a codon table can be found at: http://www.pangloss.com/seidel/Protocols/codon.html
The manual analysis that you just carried out introduced you to some of the ways that molecules evolve. The purpose of that manual analysis was to get you thinking about the mechanisms by which genes and proteins change over time, and the types of forces that control those changes. For example, when we do analyses of this type we almost always see many more changes in the 3rd position of codons than in the first position. Why is this? Do you think that these nucleotides mutate at a higher rate than nucleotides in the first position? What else might be responsible for this phenomenon?
Several computer tools have been developed that allow you to quickly retrieve, align, and compare genes and proteins from different organisms. The remaining portion of this tutorial will be carried out using The Biology Workbench, a set of bioinformatics analysis programs managed by The San Diego Supercomputing Center at the University of California, San Diego. The Biology Workbench integrates a wide variety of different programs, and this site can be used for many different kinds of analyses in addition to molecular evolution studies.
Assignment
You are going to retrieve, align, and compare hemoglobin protein sequences from a variety of animals. Print out the phylogenetic trees for both ?-hemoglobin & ?- and ? -hemoglobin. Don’t forget to write your name on the assignment. There are additional questions in the protocol that you should answer for your lab manuals only.
Do a third separate alignment of a gene of interest to you from 5 different species, print out and include the phylogenetic tree.
Or
Extend the hemoglobin analysis for 5 additional species suggested at the end of the assignment, print out and include the phylogenetic tree.
Project 1: ?-hemoglobin gene alignment
Hemoglobin is a protein in red blood cells that is involved in oxygen transport. It belongs to a family of related globin oxygen-binding genes that has evolved through a number of speciation and gene duplication events.
In the instructions below, the actions that you will need to take on the internet site are indicated in boldface type.
Computer-Aided Sequence Comparisons
First, you will need to access The Biology Workbench at:
http://workbench.sdsc.edu/
In order to use the site, you need to set up a free account. To do this, select Register and follow the instructions.
You will be taken to a new session. You can carry out several different projects at the same time using The Biology Workbench, and it keeps your different projects in different folders, which are referred to as sessions.
Now, select Protein Tools to begin.
Retrieving protein sequences using The Biology Workbench.
You are going to retrieve, align, and compare hemoglobin proteins from a variety of different animals. You will do this using a tool named Ndjinn. Click on protein tools and then use the menu to select Ndjinn – Multiple Database search, then select Run. First, you need to select the database to use for the search. In the alphabetical list of databases select Swissprot database. SWISSPROT is a protein sequence database that was begun in 1986 and is maintained collaboratively by the Swiss Institute for Bioinformatics and the European Bioinformatics Institute. This database has recently been updated, and is now divided into divisions (human, rodent, mammal, vertebrate, etc.). Be sure to search the correct division for each organism.
Secondly, in the search box, enter the term beta hemoglobin in order to search for beta hemoglobin sequences. Change hits per page to 50 Then select Search. A list of sequences will be returned (lots and lots of hemoglobin sequences are in the databases, which is one of the reasons why we are using this protein). You want to select the box next to the entry labeled
SWISSPROT:HBB_HUMAN
Hemoglobin subunit beta (Hemoglobin beta chain) (Beta-globin) [Homo sapiens (Human)]
This result won’t be at the top of the list. Make sure that you selected the sequence for the beta chain, and not the alpha, delta or gamma sequences.
Now, select Import Sequence from the list of actions at the bottom of the page.
This will return you to the Protein tools screen, but now the entry you just imported will be underneath the action menu.
For this project, you also want to import 4 more hemoglobin beta sequences into your session – from gorilla (Gorilla gorilla), mouse (Mus musculus), chicken (Gallus gallus), and bullfrog (Lithobates catesbeiana). The specific sequences you want are labeled as follows:
SWISSPROT:HBB1_MOUSE
HEMOGLOBIN BETA-1 CHAIN (B1) (MAJOR) [Mus musculus (Mouse)]
SWISSPROT:HBB_CHICK
HEMOGLOBIN BETA CHAIN [Gallus gallus (Chicken)]
SWISSPROT:HBB_LITCT
HEMOGLOBIN BETA CHAIN [Lithobates catesbeiana (American bullfrog) (Rana catesbeiana)]
SWISSPROT:HBB_GORGO
(P02024) HEMOGLOBIN BETA CHAIN [Gorilla gorilla]
Note: to get these sequences quickly, you can use Ndjinn with the appropriate search terms. Even faster, you could change the settings for Ndjinn (so it displayed all hits instead of 10) and use the ‘search in page’ feature of your browser. Select the sequences, then select Import Sequence from the list of actions at the bottom of the page.
This will take you to a page similar to the one that you were at before, but now there should be five different hemoglobin sequences listed on the page. These organisms are related in a way that should be pretty clear to you. The human hemoglobin sequence is identical to that from chimpanzees. Which of the remaining four amino acid sequences do you think will be most like the human sequence? Which sequence will be the next closest? Which sequence do you think will be the most different from the human sequence? Why?
Aligning the sequences.
Now we will carry out an alignment of the five hemoglobin sequences that you have retrieved. We will use a program called ClustalW to perform the alignment. ClustalW is a multiple sequence alignment tool that searches out the best global alignment-that is, the alignment with the most identity across the entire range of all of the sequences. The program is adjustable in that you can give different weights to different types of mismatches.
Select the boxes of all five of the hemoglobin sequences. Then select CLUSTALW Multiple Sequence Alignment from the action menu, and select Run. This will take you to a new page. Change output tree from ‘unrooted tree’ to ‘rooted and unrooted trees’, and Select submit. This will take you to a new page. Scroll down the page to the section called ‘Sequence alignment’. Each of the amino acid sequences that you retrieved is shown in the alignment.
The alignment is color-coded, so that you can see different types of changes without looking at the specific amino acids in the sequence. Four different colors are used to indicate varying levels of sequence conservation. The bright blue areas are areas that are identical across all 5 sequences. The green and dark blue areas are conserved but not identical. This means that although different amino acids are in this position, the amino acids are similar to one another–for example an amino acid with a hydrophobic R group has been replaced by another amino acid that also has a hydrophobic R group.
The black regions are unconserved–different amino acids with different properties are present in this position in the different proteins.
One thing that you can see in this alignment is that there are some regions in the protein that are highly conserved, and others that are only conserved to a lesser extent. This reflects the fact that some parts of a protein can and do evolve at different rates than other parts of a protein. The highly conserved regions are likely to be directly involved in the functioning of the protein, which in this case might be in binding the heme group or in interchain interactions. In an enzyme, amino acids located at the active site are likely to be highly conserved. The less conserved regions are unlikely to be directly involved in the functioning of the protein, for example, they may have a ‘spacer’ function to separate two other regions.
Mutations happen randomly, however, and occur without regard to the type of change in the encoded protein. When changes occur that produce a nonfunctional protein, the organisms that have that mutation are likely to be eliminated via natural selection. The selection pressure against organisms with nonfunctional or less-functional hemoglobin is likely to be high. The highly conserved regions aren’t conserved because mutations never occurred in these positions, but rather because natural selection eliminated those mutations once they did occur.
The Phylogenetic Tree created from the alignment
The ClustalW program also gives you another means of viewing the information in the alignment–it also depicts this information in the form of a phylogenetic tree.
At the bottom of the page is a dendrogram that depicts the relationships between the sequences. In this figure, the sequences are represented at the tips of the lines. A branchpoint, where two lines diverge, represents a single ancestor sequence. Two sequences that have a single branchpoint between them are more closely related to each other than to the other sequences. The horizontal length of the branches represents evolutionary distance (related to amount of dissimilar sequence), so that two sequences with a high amount of similarity are connected by a smaller distance than two sequences that are less closely related.
This type of diagram can be used to infer the evolutionary relationships between the sequences, because you would expect that sequences which are nearly identical shared a recent common ancestor. Similarly, the evolutionary relationships between the source organisms can be inferred from the relationships between the sequences.
Here, you can see the relationship that you intuitively expected–the gorilla sequence is very similar to the human sequence, because these two organisms are both primates and therefore shared a common ancestor more recently than humans and mice did.
Similarly, the mouse, a mammal, shared a common ancestor with the primates more recently than it shared a common ancestor with the chicken or frog.
Different models for evolutionary mechanisms might produce different evolutionary distances than those shown here.
• Print out the phylogenetic tree (rooted or unrooted) and include it in your homework assignment
The relationship between alpha and beta globin sequences
Hemoglobin is a tetramer (4 subunits) of two different polypeptides named alpha and beta hemoglobin. Alpha and beta hemoglobin are related to each other via an ancient gene duplication event. That is, there was one gene (the ancestral globin), and it duplicated to form two genes in the same organism (alpha and beta globins), and then these genes underwent independent evolution as the progeny of that organism replicated. This gene duplication event occurred before mammals diverged, Therefore, an alpha globin gene in a mouse is more closely related to (or, more recently diverged from) an alpha globin gene from a chimpanzee than it is to a beta globin gene from the same mouse. Gene duplication is a very important evolutionary process, and it is clear that it has happened numerous times on an evolutionary timescale.
The alpha and beta sequences arose in an ancestor common to all of the animals that we are looking at. That is, this gene duplication event occurred before the divergence of the lineages that led to modern primates, mammals, birds, and amphibians.
To import the alpha globin sequences into your session, press the gray Return button on the screen, then deselect the alignment, and then select Protein Tools. Then use Ndjinn to import the following sequences. If you carry out a search for ‘hemoglobin alpha’ in the SWISSPROT database using Ndjinn, you will have to search through several hundred responses–it may help to use the ‘find in page’ feature of your internet browser.
SWISSPROT:HBA_HUMAN
HEMOGLOBIN ALPHA CHAIN [Homo sapiens (Human), Pan troglodytes (Chimpanzee), and Pan paniscus (Pygmy chimpanzee) (Bonobo)]
SWISSPROT:HBA_CHICK
HEMOGLOBIN ALPHA-A CHAIN [Gallus gallus (Chicken)]
SWISSPROT:HBA_MOUSE
HEMOGLOBIN ALPHA CHAIN [Mus musculus (Mouse)]
SWISSPROT:HBAB_LITCT
HEMOGLOBIN ALPHA-B CHAIN [Lithobates catesbeiana (American bull frog)]
SWISSPROT:HBA_GORGO
HEMOGLOBIN ALPHA CHAIN [Gorilla gorilla gorilla (Lowland gorilla)]
Once you have all the sequences imported, select all ten sequences (your 5 beta sequences and 5 alpha sequences) and perform another Clustal W alignment of the sequences. Make sure that the entire results page has loaded (this takes a little bit of time), and then examine the unrooted phylogenetic tree of the sequences.
• Print out the phylogenetic tree (dendrogram) of alpha and beta globin sequences from the five organisms. Include it in your assignment.
Answer these questions for your lab manuals:
1. Do the alpha sequences cluster together separately from the beta sequences?
2. Is the branching pattern among the alpha sequences similar to the branching among the beta sequences?
Independent project: You may do an independent project where you align sequences from 5 different species from a gene you are interested in or you may extend the beta hemoglobin gene analysis and complete the project idea below. Print the phylogenetic tree and include it in your assignment.
Project Idea
Estimating relationships between existing mammals.
In the analysis that we carried out, it was fairly easy to predict the results because we understand the relationships between humans, gorillas, mice, chickens, and frogs. The way in which other animals are related can be difficult to understand. Even among the mammals, it may be difficult to see how different organisms are related to each other. Using beta hemoglobin sequences, investigate relationships between the organisms below.
Minke whale Killer Whale Harbor seal Indian Elephant White Rhinoceros Brazilian Manatee European River Otter Polar Bear Hippopotamus
Before carrying out the analysis, predict the relationships between the different organisms.
1. Are whales and seals more closely related to one another than to any of these other species?
2. To which organism is the manatee most closely related?
3. To which organism is the elephant most closely related?
4. To which organism is the otter most closely related?
After carrying out the analysis:
1. Are whales and seals more closely related to one another than to any of these other species?
2. To which organism is the manatee most closely related?
3. To which organism is the elephant most closely related?
4. To which organism is the otter most closely related?
Would you predict that you would see the same results if you used a different protein for the analysis, for example the cytochrome c protein? Why or why not?
The following review questions will help to reinforce what you’ve learned in this tutorial. Answer them for your lab manual.
Questions – Review
1. Why was it necessary to carry out an alignment of the sequences we analyzed?
2. Nucleotide substitutions might lead to changes in the sequence of amino acids in a protein, which can be seen by comparing homologous positions in a sequence alignment. How would deletions or insertions appear in a sequence alignment?
3. Sometimes gene duplications result in the formation of a pseudogene, a sequence of nucleotides that is very similar to a gene but isn’t expressed. This can occur in a variety of ways, for example if the gene is duplicated but a regulatory region or the promoter region for that gene is not duplicated. Which would evolve (accumulate nucleotide substitutions) more rapidly, a pseudogene or a duplicated gene that was expressed? Explain.