GRASP Ancestral Sequence Reconstruction

Reconstructing ancestors with GRASP

Data Downloads

Tutorial FASTA File, Curated Alignment and Newick Files, the data for the tutorial, a curated tree and alignment ready to upload to GRASP.

Required Software

MAFFT - Multiple sequence Alignment (MA) based on the Fast Fourier Transform (FFT). See Katoh and Standley (2013).
RAxML - Randomized Accelerated (RAx) Maximum Likelihood (ML) to infer evolutionary relationships. See Stamatakis (2014).
Alternative to RAxML: FastTree. See Price, Dehal, and Arkin (2010)

We also recommend using an alignment viewer, such as Aliview or Jalview and a tree viewer such as Figtree or Archaeopteryx.

Although the tutorial is written assuming you have installed MAFFT and RAxML, it is also possible to use other alignment and phylogenetic tree inference packages to complete the steps. The tutorial can also be completed by using online-only tools such as MAFFT online for alignment and IQ-TREE for tree inference.

Pre-requisites

Install required software

MAFFT (https://mafft.cbrc.jp/alignment/software/) and RAxML (https://sco.h-its.org/exelixis/web/software/raxml/) are command-line programs for aligning sequences and inferring a phylogenetic tree. Instead of RAxML you may also wish to install FastTree (http://www.microbesonline.org/fasttree/). If you do not have these already, install the software on your machine using the instructions found on the websites for your operating system.

MAFFT for Mac/Linux/Windows

Download the executable file for your operating system from the MAFFT website and install using the prompts.
RAxML for Mac/Linux
1. Download the source files from GitHub here, e.g. download the files or clone the repository by running the following command from your home directory in a terminal (command line) window
```
git clone https://github.com/stamatak/standard-RAxML.git 
```
2. Navigate into the standard-RAxML directory using
```
cd standard-RAxML 
```
3. Make the files by running
```
make -f Makefile.PTHREADS.gcc
```
4. Remove all output files that were created by running
```
make 'rm *.o'
```
RAxML for Windows

There is a user interface available to download here. See Silvestro and Michalak (2012).
FastTree

Detailed install instructions for all operating systems can be found at http://www.microbesonline.org/fasttree/. See Price, Dehal, and Arkin (2010).

Curate an alignment and infer a tree

Download the extant sequence library

The sequences we will be reconstructing are from the cytochrome P450 2U1 (CYP2U1) subfamily and the cytochrome P450 2R1 (CYP2R1) subfamily. Cytochromes P450 are key drug metabolisers and we wish to infer a CYP2U1 ancestor. We will be using four CYP2R1 sequences as an outgroup and these four sequences will be labelled as 25hydroxylase or 25hydroxylase-like in the data set. The data set can be found here.

The data set was compiled from searching NCBI for homologs to representative CYP2U1s. For the purpose of illustration, we are not using the full set.

1. Align the extant sequences

Align the sequences together using the default MAFFT settings. In effect, this step identifies the positions at which sequences are (putatively) homologous.
In the console of your terminal window, type the following command in the same directory where you placed the downloaded tutorial .fasta file:

mafft GRASPTutorial.fasta > GRASPTutorial.aln

This will perform the multiple sequence alignment and save it to "GRASPTutorial.aln" in the same directory. By default, MAFFT uses the FASTA format, with an additional "gap" character.

2. Infer a tree using RAxML

We will be using RAxML to infer by maximum likelihood a phylogenetic tree.

If you are using the user interface, use 'GRASPTutorial.aln' as the input alignment and follow the prompts in the interface. Save the resulting tree, and move to Step 3.

If you are using the command line version (recommended), type the following command in terminal from within the directory where you installed RAxML.
Note: you will need to copy the input file (GRASPTutorial.aln) into the RAxML directory.

./raxmlHPC-PTHREADS -m PROTGAMMAJTT -p 23456 -n GRASPTutorial.nwk -s GRASPTutorial.aln

-m PROTGAMMAJTT

PROTGAMMAJTT specifies to optimise substitution rates, use a GAMMA model of rate heterogeneity, and use the JTT amino acid substitution matrix.

-p 23456

Random number seed for inference.

-n GRASPTutorial.nwk

Name of the output file.

-s GRASPTutorial.aln

Name of the input alignment file.

When finished (approximately 1 minute, depending on your machine) you will see some summary information output to the console. This will also tell you where the output file has been saved.

Or 2. Infer a tree using FastTree

We will be using FastTree to infer by maximum likelihood a phylogenetic tree.

Type the following command in terminal

FastTree GraspTutorial.aln > GraspTutorial.nwk

This will compute a phylogenetic tree based on GraspTutorial.aln and save it as a Newick file in GraspTutorial.nwk

3. Inspect the alignment and phylogenetic tree

Use an alignment viewer (such as Aliview or Jalview) and a tree viewer (such as Figtree), to inspect the resulting alignment and inferred tree (e.g. .../RAxML_result.GRASPTutorial.nwk).

Do any sequences look out of place in the alignment and the tree? Note the sequence/s down.

4. Perform quality control on input data

Remove the erroneous sequence/s from the input fasta file.

Perform the alignment and tree inference steps with the modified file (Steps 1-2).

Inspect the new alignment and inferred tree (Step 3).

Do any sequences look out of place? You might not see any obvious erroneous sequences in the tree, but some sequences may appear to have local misalignments and have several large internal deletions, despite appearing to align well overall. Note the sequence/s down and repeat this step.

5. Export tree for inference

Once you are happy with the alignment, using FigTree, re-root the tree on the branch between the CYP2U1 and the CYP2R1 sequences.

Note: FigTree may strip the longer name in the sequence labels and so you may need to refer back to your alignment to identify the sequences that belong to 2U1 and those that belong to 2R1. Once re-rooted, you should have a tree with two distinct groupings of approximately 4 and 20 sequences.

Export the tree in FigTree as a Newick file, say "GRASPTutorial_FigTree.nwk": File > Export Trees... select Newick from the menu and select the 'Save as currently displayed' option.

Note: Since FigTree surrounds the labels with quotation marks, the labels will not match those in the alignment file and will cause an error when attempting to run the reconstruction.

You can open the Newick file in a text editor to remove all single quotation marks from the labels, or more simply use a command such as "tr" to delete all occurrences of "'":

tr -d \' < GRASPTutorial_FigTree.nwk > GRASPTutorial.nwk

Alternatively, you can use Archaeopteryx to view and reroot the tree by right-clicking on the desired node and selecting Root/Reroot

Infer ancestral sequences

If you have had issues curating the alignment for the example data set, or want to jump ahead to exploring GRASP, download the curated alignment and tree from here.

6. Perform the reconstruction using GRASP

Run the final alignment and tree through GRASP.

GRASP may indicate that some of the sequences are obsolete.

Note down any obsolete sequences and optionally remove these from the original extant sequence file. Perform the alignment and tree inference steps with the modified file (Steps 1-2). Repeat Step 6 with the new alignment and tree files.

7. Inspect the reconstructed sequences

Explore GRASP:

Click Annotate Taxonomy at the top of the tree

Note: For this action you will need to register an account, login and save your reconstruction. This will get the taxonomic information for the sequences. Note: taxonomic information will only be displayed if the input sequences are labelled with the NCBI or Uniprot identifier (as in this tutorial).

Left-click on tree nodes

This will show any taxonomic information (if available) of the selected ancestor. At ancestral nodes, this displays a summary of the taxonomic information for all child extant sequences. Common ranks are listed, and a histogram is displayed for differing taxonomic information. Extant sequences (or leaf nodes in the tree) show the full available taxonomic information for that extant.

Right-click on tree nodes

This will show a menu listing options for displaying the tree and performing further reconstructions. From this menu, we can collapse and expand the tree nodes, perform a joint or marginal reconstruction, or add a joint reconstruction graph that will be displayed below the current joint reconstruction. See the guide for more details.

Inspect the reconstructed ancestral POG

At the bottom of the page you will see at least two partial order graphs. The top graph is the alignment graph (POAG; also referred to as MSA as it derives directly from the input multiple sequence alignment), and the bottom graph/s are the reconstructed ancestor graphs (POGs). Hovering over the graph nodes will display a popup showing a histogram of the characters in the alignment POAG and in a marginal reconstruction POG. More information about these nodes can be found in the guide.

We can navigate across the reconstructed graph/s by sliding the purple rectangle along the navigation line above the (MSA) POAG.

Explore the other display options on the page.

8. Explore insertions and deletions

Identify the node that represents the ancestor of only the CYP2U1 sequences and infer the ancestor here.

Note: we re-rooted the tree in Step 5 on the ancestor of the CYP2U1 and CYP2R1 ancestors, i.e. the root node is the ancestor of both the CYP2U1 and CYP2R1 sequences.

What do the red circles in the navigation bar indicate?

Note: you may need to visually compare the ancestral POG to the (MSA) POAG and inspect the regions where there are red circles on the navigation line to help answer this question. It may also help to look at the ancestor of just the CYP2R1 sequences

Which sequences contribute to the edge in the (MSA) POAG that jumps the node that is missing from the CYP2U1 ancestor? Does this explain why this node isn't inferred in this ancestor?

Take note of the node ID (i.e. the grey number under the node) of the first red circle.

Infer the CYP2R1 ancestor and navigate to the node ID you took note of.

Insertion or deletion events are also indicated by grey boxes in the navigation bar. When we look at the ancestral POG, we will see multiple paths between nodes. This means that groups of characters have been inferred to be parsimonious. The darker paths find greater support than the lighter paths; however, both could be considered.

9. Investigate differences between the CYP2U1 ancestor and the CYP2U1 fish ancestor

Now let's use GRASP to explore differences between a set of ancestors. Identify just the CYP2U1 sequences that have come from ray-finned fish.

We want to identify differences between the CYP2U1 ancestor, the CYP2U1 fish ancestor, and the CYP2U1 ancestor of the non-fish sequences.

Set up your data so you are displaying the (MSA) POAG, the CYP2U1 ancestor, the CYP2U1 fish ancestor and CYP2U1 non-fish ancestor simultaneously.

Note: investigate the difference between using the 'Add joint reconstruction' and 'View joint reconstruction' commands to achieve this.

Can you identify differences between these inferred ancestors?

Look at the first grey box in the navigation bar and note the differences between the ancestors.

Which sequences are contributing to the alternative pathway found in this first grey box? Does this explain the differences between the fish and non-fish CYP2U1 ancestors at this spot?

Look at the second grey box in the navigation bar and note the differences between the ancestors.

Remember that GRASP indicates a more supported path (based on edge parsimony) as a darker edge.

Also, now that we're adding POGs to look in tandem, it's important to realise that the red circles that indicated deletions are only shown in relation to the last added ancestral POG, while grey boxes will summarise the alternative pathways across all added ancestral POGs

Why might it be useful to consider both paths when reconstructing an ancestor?

10. Download the preferred path sequence of the CYP2U1 ancestor or Save your session

Reconstruct the CYP2U1 ancestor.

Press the 'Download Results...' button underneath the reconstructed graph, select 'Preferred path sequence of...' and press the 'Download' button. Preferred path sequences only follow a single path, so you can use your alignment viewer to look at the downloaded results.

Note: there are a few things that can be downloaded from your reconstruction, mostly more complex representations of the reconstruction (refer to the GRASP Guide). Alternatively, you can save your reconstruction for later by pressing the blue disk at the top left corner and following the prompts for creating an account with GRASP. You can access the list of previously saved reconstructions by clicking your account name which will appear in the menu at the top right corner.

11. Inspect the probabilities of amino acids at an ancestor

So far we have only looked at the so-called joint reconstructions of ancestors. We will now perform marginal reconstruction of the CYP2U1 ancestor. Refer to the GRASP Guide for more technical information on the difference of the two types of inference. Marginal reconstruction will allow us to form a greater understanding of the biological variability that exists at a specified ancestor; we are essentially asking 'what is the probability of each amino acid, in each position'.

Select 'View marginal reconstruction' for the CYP2U1 ancestor. You will note that the topology of the POG is identical to that of performing joint reconstruction. (You can use the navigation bar to look at the same two red circles, for instance.) The difference lies in what information is inferred for nodes with character states. Inspect that information by hovering over or clicking the nodes in the inferred POG.

Can you identify two nodes where the probability of one amino acid is about 50% and a second almost as big? Make a note of the two node positions they occupy (i.e. the node IDs).

Such positions are good candidates to mutate to explore ancestral variants. A new menu item 'View mutants' has appeared above the navigation bar. Click on it and increase the number to '1'. For each increment a small triangle appears somewhere along the navigation bar; it indicates a position in the POG which is a good candidate for mutation. By using the navigation bar, inspect and make a note of the distribution of the node identified by the first triangle to appear. Then, keep incrementing it until you see a triangle identify the positions of one of your own 50/50 nodes.

If you were to allow a mutation to happen, which of the positions above do you think would best explore the space of possible CYP2U1 ancestors?

See the GRASP Guide to work out why the program chose its position.

Bring your own sequences

Repeat Steps 1-11 using your own sequence library.

A major issue with the example above is the limited number of sequences that we have included. Using the public databases as at early 2018 the set of high-quality homologs of CYP2U1 is close to 200. Excluding sequences is likely to impact the quality of your ancestral predictions.

To construct your own sequence set for a protein family of interest to you, we recommend that you make use of resources such as Uniprot and NCBI. If you have specific (seed) protein sequences for which ancestors you would like to explore, we suggest you use tools such as BLAST to create an initial sequence library. It is imperative that you exercise caution as exemplified above, before you interpret ancestral predictions. There is no tool currently that will curate your data as well as a well-informed protein scientist...

References

Katoh, K. and Standley, D. M. (2013) MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol Biol Evol. 30(4): 772–780. DOI.
Stamatakis A. (2014) RAxML Version 8: A tool for Phylogenetic Analysis and Post-Analysis of Large Phylogenies. Bioinformatics. 30(9): 1312-1313. DOI.
Price, M.N., Dehal, P.S., and Arkin, A.P. (2010). FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments. PLOS ONE 5, e9490.DOI.
Han MV and Zmasek CM (2009) phyloXML: XML for evolutionary biology and comparative genomics. BMC Bioinformatics. 10:356. DOI.
Larsson, A. (2014) AliView: a fast and lightweight alignment viewer and editor for large data sets. Bioinformatics. 30(22): 3276-3278. DOI.
Silvestro, D. and Michalak, I. (2012) - raxmlGUI: a graphical front-end for RAxML. Organisms Diversity and Evolution. 12(4): 335-337. DOI.

GRASP

Graphical representation of ancestral sequence predictions