GRASP has a web site for code, associated tools, data sets, and support information - see GRASP-suite. This site is updated to reflect changes to the service and resources around it.
You can peruse and submit issues for the web service GRASP/issues and the command-line tool bnkit/issues.
You can see updates to this website at the GRASP feature log
If you use GRASP in your work please cite the paper listed below.
GRASP infers ancestral proteins from homologous protein sequences--a process known as ancestral sequence reconstruction. (It does not work with DNA at the moment, though we expect to introduce this in the future.) GRASP efficiently determines ancestral character states, and the most supported insertions and deletions. GRASP presents all this as partial-order graphs using a visual interface, connected to a phylogenetic tree, with which the user can interact.
Many features are only available as a result of saving an ancestor reconstruction. To enable this, you need to be signed in as a user. We strongly recommend that you do this as the workflow will then be un-interrupted. Use the 'Login' menu to start the registration process. Note that the web service will log you out automatically if you are idle for a longer period of time. In that case, login again, and your saved reconstructions will be available from the menu item marked with your user name.
Passwords are stored in an encrypted form, but our strong advise is to use a unique password as the http protocol is subject to some security risks.
GRASP requires only two input files and one parameter to run successfully. On GRASP's landing page, you will require two files: a set of (aligned) sequences and the corresponding phylogenetic tree.
GRASP accepts the following file formats.
Alignment | Phylogenetic tree |
---|---|
FASTA, Clustal | Newick |
To try GRASP out, on the landing page you are presented with the option of using example data (see (1) in Figure 1). This will automatically load an alignment file and tree file; you can then run the reconstruction allowing you to explore the results provided by GRASP. We will use results from running the test data labelled 'Afriat-Jurnou et al. (29)' and 'CYP2U1 (595)' for the figures below. Data sets are taken from publications listed at the bottom of this page
If, at any stage, you require 'on-screen tips' about what you are seeing presented on the GRASP home or results page, you can hit the help button to switch on help texts (marked (?) in Figure 1).
To allow you to keep track of your results, on the GRASP landing page you must provide a unique name for your job (see (2) in Figure 1). This name will be used as an identifier in any saved output data.
Once you have a name, you can 'Select sequence alignment file...' (see (3) in Figure 1). Accepted alignment formats are noted in the table above.
You may want to use any of the multiple sequence alignment tools/services in the non-exhaustive list below to create your alignment file.
A phylogenetic tree must be provided using 'Select phylogenetic tree file...' (see (4) in Figure 1). Accepted tree formats are noted above.
Some of the alignment tools listed above can provide phylogenetic trees. Alternatively, some tools/services/packages for creating trees from your alignment are listed below.
GRASP requires your alignment to have the same identifier names as your phylogenetic tree. Some programs will change the identifier names of your alignment in order to remove certain characters.
For example, using MAFFT to generate trees will change certain characters into underscores, and RAxML will require you to remove certain characters from identifier names before it can be used.
SeqScrub, developed by the GRASP team provides an online interface to automatically change characters in sequence or alignment files, to keep a record of these changes, and to update corresponding phylogenetic trees at the same time.
SeqScrub can also add taxonomic information, and check that your sequences have not been made obsolete by the NCBI or UniProt databases.
If there are inconsistencies in your filenames, GRASP will provide an error message detailing the list of identifier names in both your alignment and phylogenetic tree that do not match.
GRASP can perform reconstruction based on a variety of evolutionary models that can be selected from the drop down menu 'Select other evolutionary model...' (see (5) in Figure 1). By default, JTT (Jones-Taylor-Thornton) is used, which is commonly used for the analysis of amino acid sequences.
GRASP can perform both joint and marginal reconstructions. GRASP will initially run a joint reconstruction; we have made marginal reconstruction available to be run on specific nodes in the results section of GRASP.
Once test data has been selected or user data provided, the reconstruction can be started (see (7) in Figure 1).
We strongly recommend that you save your reconstruction before running it (see (6) in Figure 1). This means that you have to be signed in as a user (see Login in the menu top-right of Figure 1), and that immediately after starting the reconstruction you will be forwarded directly to your collection of saved reconstructions. You will need to wait for an email to notify you if the reconstruction was successful. If successful, you will be able to refresh you list and select it from there. If you received an error, please see below for possible causes and remedies.
If you opted not to save your reconstruction, the screen will be blocked and you will need to wait until the reconstruction finishes. Any errors will be displayed. If the reconstruction is successful, you will be forwarded to the results page (see below). You can now save the reconstruction (click the floppy disk top left), which will prompt you to login (and saving starts). Many functions will not be available until you do.
Any saved reconstruction, can be re-loaded without completely re-running it. You can also share your reconstruction with other users--as long as you have their user names.
The following is a list of error messages that can be generated by GRASP and suggestions as to how to proceed if you encounter them.
Error message | Suggested actions |
---|---|
|
Your input files are in the incorrect format. Check that they are valid FASTA or Clustal formats with the correct file extensions. Check that your Clustal file has a correct header (starting with the word “CLUSTAL”). If they are in a different format there are tools available online to convert them to FASTA or Clustal (https://www.ebi.ac.uk/Tools/sfc/emboss_seqret/). |
|
An alignment file by definition should contain sequences with the same length. Check that your alignment file contains aligned sequences and that they are all the same length.. |
|
Check your Newick file, specifically it is worth checking that you have a single Newick string that contains the correct number of opening and closing brackets, and that you do not have duplicate identifiers. |
|
GRASP splits Newick files on the “:” symbol and expects a distance to come immediately after each “:” symbol in the file. Check that the indicated value only contains numbers. If your identifiers themselves contain the “:” symbol they should be replaced (taking care to replace them in the corresponding sequence alignment). |
|
Either your files contain sequence names that are duplicated or sequence names do not match the names of nodes in the phylogenetic tree. You will be notified of names that are unique for each file. GRASP will not guess what sequence that should be placed in the tree, so make sure the names agree in the two files. Some alignment programs will amend the names, which means you may need to edit them manually. |
|
Check your sequence alignment and phylogenetic tree files and replace or remove the invalid symbols. In cases where ‘X’ has been used in place of an unknown amino acid these can either be corrected by replacing the ‘X’ symbols with a ‘-‘ symbol or by identifying higher quality sequences without unknown amino acids and realigning and inferring new phylogenetic trees. |
|
Download and install a local, commandline tool that you can run on your own hardware. This will not have the graphical user interface, however. See GRASP-suite. |
You should also be aware that the server only keeps the results for a limited time. You may notice that if you leave the results page open for a long time without interaction, it may stop responding.
The results page shows an interactive version of the phylogenetic tree that was provided for reconstruction. By default, labels of extant sequences are displayed and the user is given the option to also display or hide labels and branch lengths (see (2) in Figure 2). It is possible to search for nodes given search pattern (see (3) in Figure 2). If extant labels follow a standard format, it is possible to download taxonomic information from NCBI and UniProt by the press of a button (see (2) in Figure 2).
GRASP will number ancestors (i.e. phylogenetic branch points) internally from the root (N0, N1, ...) using the principle of 'depth first'. N0 is always going to be the root of the tree. You can see an ancestor label by hovering over the node, or in a pop-up box by clicking it. If available, taxonomic information will be collated and summaries for ancestor nodes are shown when clicked at (more information below).
It is possible to upload annotations for extant labels (see (1) in Figure 2) that can subsequently be searched for (more information below).
It is possible to search for sequence motifs in the ancestor sequences (see (3) in Figure 2); the syntax is described below.
For big trees you may want to reduce the number of nodes shown. Use the buttons at (4) to adjust how many nodes are collapsed and the size of text labels.
Each node is coloured by evolutionary distance according to the legend on the left hand side (see (6) in Figure 2). One node will always appear pink and be noted in the header above the tree. This is the node that is having its reconstruction displayed in the graph below the phylogenetic tree and is known as the primary node. By default, N0 (the root node) is the primary node, will appear pink and be listed in the header.
By hovering over nodes in the phylogenetic tree, the name will be revealed. By right clicking an ancestral node in the tree you will bring up a number of operations which can be performed on the ancestor. Note that some operations may take time to complete. (Unless previously run for the specified ancestor, marginal reconstruction will usually take as long as the initial reconstruction.)
Option | Operation |
---|---|
View marginal reconstruction | Change visualisation to display the marginal reconstruction of this ancestor; make this ancestor the primary node in the phylogenetic tree. The ancestor in the phylogenetic tree will be highlighted in pink. |
View joint reconstruction | Change visualisation to display the joint reconstruction of this ancestor; make this ancestor the primary node in the phylogenetic tree. The ancestor in the phylogenetic tree will be highlighted in pink. |
Add joint reconstruction | Add a panel to the visualisation of another ancestor, based on the current joint reconstruction. Additional graphs are "stacked" and shown with a blue background (the hue is determined by the evolutionary distance to the root). |
Collapse subtree | Hide all descendants/children of this ancestor. Use this when your tree is large, and hard to read. |
Expand subtree | Expand the view of descendants/children of this ancestor. Note that this operation needs to be repeated for big subtrees (to avoid flooding the screen). |
GRASP can assign tabulated series of values to both extants and ancestors by their labels (see (1) in Figure 2). For example, a user can use this to assign functional annotation indexed by a column header, say 'Substrate', to a certain protein, say 'N123' or 'P12345'. This would be uploaded in the form of a tab-separated values (TSV), e.g.
ID Name Substrate Activity
N123 Fav_ancestor Luciferase High(100)
P12345 My-enzyme Luciferase Moderate(10)
P54321 No-enzyme None None(0)
Upload this content (paste or type into the input window), and use the same popup menu to enter which column you want to search in, say 'Substrate', and the term that needs to be matched for a node to be highlighted, say 'Luciferase'.
GRASP can automatically retrieve taxonomic information from the NCBI and UniProt databases by clicking on the Annotate taxonomy button, and append this information to the extant nodes of the phylogenetic tree. (The first time this is done, information needs to be downloaded so please be patient.) Clicking on an ancestor node will display a summary and histograms for each taxonomic rank, with numbers of extants that fall under each label.
For example, in Figure 4, we used the example data set 'Foley et al. DHAD (1612)', clicked annotate taxonomy, and clicked on ancestor 'N1087', which has about 400 extants.
In order to annotate taxonomy from the NCBI and UniProt databases, GRASP expects the sequence IDs in your FASTA file to be in either the NCBI or UniProt ID format -
>PRD21445.1 Cyp2u1 [Nephila clavipes]
or
>tr|A0A2P6K539|A0A2P6K539_NEPCL Cyp2u1 OS=Nephila clavipes OX=6915 GN=NCL1_51699 PE=4 SV=1
By default, GRASP will expect the ID to be separated by a whitespace from any additional information.
However, it will accept IDs which are completely stripped of additional information -
>PRD21445.1
or
>tr|A0A2P6K539|A0A2P6K539_NEPCL
Or IDs that have all whitespace removed (in order to force annotations to appear on phylogenetic trees) if the additional information is separated from the ID by a pipe symbol (|) -
>PRD21445.1|Cyp2u1_Nephila_clavipes
or
>tr|A0A2P6K539|A0A2P6K539_NEPCL|Cyp2u1_OS=Nephila_clavipes
Note that this means that headers where additional information is not separated by either a whitespace or a pipe symbol will not be able to have their taxonomy annotated - i.e. the following will not work -
>PRD21445.1_Cyp2u1_Nephila_clavipes
or
>tr|A0A2P6K539|A0A2P6K539_NEPCL_Cyp2u1_OS=Nephila_clavipes
All of the existing constraints about sequence names matching between alignment and phylogenetic tree must still be adhered to.
You are able to search for ancestors with specified sequence motifs in saved joint reconstructions; note that searching is done only in preferred sequences, and not in optional paths (see more below).
%A%
: Highlight sequences with an A
in any position, you can also use just an A
however the %
signs are recommended.
%A-*M%
: Highlight sequences with an A
followed by any number of gaps and then an M
.
%AM(P|R)%
: Highlight sequences with an A
followed by an M
, followed by either a P
or an R
.
%A_M%
: Highlight sequences with an A
followed by any other character (e.g. a gap, K
, P
etc) followed by an M
.
_
: Wildcard character.
|
: denotes one of the two options.
*
: repetition of preceding amino acid 0 or more times.
{n}
: repetition of the preceding amino acid n
times.
{m, n}
: repetition of the preceding amino acid more than m
times and less than n
times.
Note 1: Since the gap patterns from the original alignment are preserved, unless you specify gaps or use (-*
) between your search elements, only those sequences without gaps between the search terms will be highlighted.
Note 2: The searching looks for a match anywhere in the sequence, this means that even though two sequences are highlighted, they might not have the motif in the same position! So it is recommended you confirm your findings by downloading the joint reconstructions and checking the alignment.
GRASP uses partial-order graphs to represent ancestors. They are suited to this task as (a) sequence homology amongst members in a protein family (and by extension their ancestors) may not always map unambiguously to a single, linear ordering, and (b) unresolved insertions and deletions can be identified as alternative paths through the graph.
We suggest that you read a partial-order graph from left-to-right, node-by-node, forming a putative biological sequence (N- to C-terminus) by following a single path. While GRASP attempts to resolve the history of insertions and deletions, at a single node, you may be presented with multiple in- and out-going edges, each of which joins potential ancestral sequences together.
A node represents the character state at a position in a sequence. The position is not defined in an absolute sense, rather in a relative sense when there is evidence of an order. For practical reasons, there is still a positional index to assist navigation in GRASP.
The node displays either a single character state or a distribution over possible character states. The states of ancestral nodes are inferred by maximum likelihood.
An edge represents a valid path to create a full-length sequence. Once homologous sections have been determined by alignment, the joint set of extant sequences express the space of possible paths that can be taken. Of them, by enumerating them at each node, GRASP infers the maximally parsimonious edges, i.e. the edge or edges that imply the least number of changes across the whole evolutionary history. We call this process bi-directional edge-parsimony (BDEP). The term "bi-directional" here refers to the separation of in- and out-going edges of a node, and therefore an edge can be maximally parsimonious, for one or both nodes it joins. You can find a more detailed explanation here.
GRASP identifies edges for which there is qualified support, and makes distinctions that are important to interpret sequences that you assemble through them:
Solid edges are fully supported by bi-directional edge-parsimony; these links are strong candidates for how an ancestor can be assembled. They are not exclusive; multiple paths connecting different parts of a full-length sequence may be present. Nor are they guaranteed to identify even a single full-length sequence.
A dotted edge is maximally parsimonious when only one of the nodes it joins is considered (which one is not shown). These edges are weaker candidates but are kept to allow full-length ancestors to be formulated in the absence of the stronger option above.
GRASP also uses the number of extant sequences on which the inference is based. This can be used to separate edges further:
The darker the edge, the more extant sequences (of the set in the subtree under the ancestor) travel through the same edge.
GRASP provides the option of displaying one preferred path through a sequence. If this option is turned on, thick/bold edges represent a single, preferred traversal of the graph, giving preference to (a) bi-directional edge-parsimony support (edge is drawn dark) and (b) greater number of extant sequences following (in the subtree under the ancestor).
After running reconstruction, you will be able to see a navigation bar (see (1) in Figure 5), a multiple sequence alignment graph (see (2) in Figure 5) and the primary ancestor graph(see (3) in Figure 5) underneath the phylogenetic tree.
GRASP allows you to 'add' ancestors to be visualised along side the primary ancestor. Each added ancestor becomes the current ancestor graph.
The navigation bar has a blue box (see (4) in Figure 5) which can be expanded, or collapsed and moved across the window to view different sections of the graphs. The region of the graph that you are viewing is within the blue box. Relative to the primary ancestor that is being viewed (see Phylogenetic tree) you will see grey boxes and red circles across the navigation bar.
Grey boxes represent insertions and deletions in the ancestor that cannot be resolved definitively; they indicate the presence of two or more edges in the ancestor at the corresponding position.
Red circles represent definitive absence of content in the current ancestor, relative to the input sequence alignment graph.
If you add multiple ancestor POGs shown together, grey boxes are determined from the union of ancestors, but red circles apply only to the current ancestor.
The multiple sequence alignment is represented here as a partial-order graph that contains paths to regenerate all sequences uploaded by the user. Each node in the graph is depicted as a pie chart representing the distribution of amino acids at that position across all extant sequences. The most probable amino acid is shown using its one-letter code. When you hover your mouse over each node, a bar chart will display, showing the distribution in more detail, as seen in Figure 6.
The edges in the multiple sequence alignment graph represent the insertion and deletions of all extant sequences.
This graph displays the partial-order graph of the primary ancestor. More technical information of joint reconstruction is provided below.
Edges in reconstructions are inferred by edge-parsimony. One or more ancestral sequences can be formed by following one path. We recommend that solid edges are given preference, but all presented edges have some support.
When the 'Add joint reconstruction' option is selected for a node in the phylogenetic tree, its reconstructed sequence will be added below the primary ancestor (boxed with a blue shade that indicates the evolutionary distance of the phylogenetic node it is taken from to the root). We refer to this as the 'stack' and you can keep adding joint reconstructions to it; this allows for comparisons between ancestral states at different phylogenetic branch points.
Graphs can be saved as PNG images using the 'Save graphs' button (see (5) in Figure 5). The amino acid colour scheme is, by default, set to Clustal.
A marginal reconstruction can be performed on any ancestor selected in the phylogenetic tree on the results page. A marginal reconstruction will look similar to Figure 8.
The reconstruction of the primary node will be composed of pie charts similar to the input sequence alignment graph. Each pie chart depicts the distribution of amino acids at each position in the reconstructed ancestor. More technical information about marginal reconstruction is provided here.
Once a marginal reconstruction has been performed, it is possible to identify targets for mutational analysis. This is based on the idea that the reconstruction represents just one possible ancestor, and that others can be explored around it by mutating positions that are inferred to be in an ambivalent state.
The user can select the level of complexity of the mutant library. The base case "0" means that all nodes map to the single, most probable amino acid. You will note by setting it to "0" all nodes become single-state. By increasing the level, alternative amino acids are added to positions that are more ambivalent; at higher levels of complexity, some positions are assigned two or more mutational targets. An arrow appears close to the location of the identified nodes alongside the navigation bar. You will note that the ambivalent nodes are shown with more character states. More technical information is provided below.
You can download results generated by GRASP, through the download section. The download section is expanded by clicking the "Download Results" drop down button located below the graph visualisation area (see (1) in Figure 9).
Note that sequences are based on a single, preferred path in each, more information-rich ancestor partial-order graph. We recommend that you still consult with the partial-order graph to corroborate what the evidence for the ancestor sequence is.
To download results, click the appropriate button.
The tree (2) (in Figure 9) is the tree the user provided but with internal/ancestral nodes labelled (N0, N1..etc.).
A marginal reconstruction (3) (in Figure 9) is available as either TSV file (tabulated for the probability of each character, in each position) and FASTA file (with the most probable character in each position). Note that only the single, primary ancestor is downloaded.
A joint reconstruction (4) (in Figure 9) is available as a FASTA file (with nominated ancestors, or the full set of ancestors).
GRASP employs exact probabilistic inference; it distinguishes between two types of reconstructions:
Joint - GRASP infers the most probable combination of all ancestral states. The assignment of all ancestral states (across the whole phylogenetic tree) optimises a single objective (the joint state with the greatest likelihood). This will tell you which are the most probable ancestors, given the extant sequences and their phylogenetic relationships. The inference is valid for the whole tree, and it is practical to compare ancestors with one another as they represent snapshots from the same evolutionary history.
Marginal - GRASP infers the probability distribution of each residue for a single, nominated ancestor. This inference will present you with the probability of amino acids at each position in the nominated ancestor, given extant sequences and their phylogenetic relationships. The inference "marginalises" all other ancestors, which means that (a) each ancestor needs to be inferred separately, and (b) resulting distributions are not directly comparable. Multiple marginal reconstructions (for different ancestors) are based on the same input data, but their mutual consistency is not enforced; they allow for different versions of evolutionary histories along the same branches.
The most probable character state in each character distribution resulting from a marginal reconstruction is commonly identical to that from a joint reconstruction; so, in practise, it may not matter much which inference type you use.
Once GRASP performs a joint reconstruction it will determine and keep all ancestral states in memory; this process can take some time to complete. "Adding" nodes on the results page will not require additional inference of ancestral states, so the delay you experience is minor. Marginal reconstruction needs to be performed for each node; inference is separate so there will be longer delays each time even after you have moved to the results page. (That said, we now have implemented caching of these states, so if you have "added" them before, the delay is minor.)
Edges represent optional paths through which sequences can be composed; ultimately, the collective of extant sequences when aligned determines the set of all possible paths; that set is the source from which we draw the path or paths for any ancestor. We are interested in identifying which edges that are best supported for an ancestor, given their phylogenetic relationships. For this we use the basic principle of maximum parsimony, where only the most parsimonious edges are kept for each individual ancestor.
Maximum parsimony is applied to edges of a node; for each node, we separately process edges labelled by the nodes preceding it, and edges labelled by the nodes following it. An edge can therefore be supported in one direction but not the other. Uni-directional support is sufficient for GRASP to display an edge and maintain any paths to the nodes concerned; bi-directional support should be seen as a much stronger endorsement though. We run maximum parsimony to identify all edges that achieve the best score (the same score can be shared by multiple edge selections). This means that in many cases multiple paths through an ancestor cannot be ruled out.
Not uncommonly, multiple indel histories are equally parsimonious, implying that several ancestor candidate sequences can be identified by traversing an ancestor POG; however, in some applications it is necessary to nominate a single sequence. To determine a 'preferred' path through an ancestor POG, we determine the proportion of extant sequences in the subtree under the ancestor that follow each edge. We use this proportion to express preference between multiple edges.
GRASP uses the A* algorithm to determine the selection of edges in a POG that jointly minimise the cost, travelling from the N- to the C-terminus. GRASP imposes an absolute preference for bi-directionally parsimonious edges; a uni-directional is only chosen in the absence of bi-directional edges to complete the traversal. The exception is the edge to the first node, and the edge from the last node, where bi-directionality is disregarded.
A marginal reconstruction allows the user to identify the positions that should be mutated with priority, to explore the space of probable ancestors.
Initially, the complexity of the mutant library is "0", suggesting for each position, only the most probable amino acid. At "1", a single mutation is allowed, at the position that best improves the summed Kullback-Leibler divergence between the mutant distributions and the corresponding marginal distributions, across all sequence positions. For each increase in the level, another mutation is identified where an alternative amino acid is suggested. Note that additional mutations could occur at a position that has been mutated previously or in a different position. Either way, the potential library will increase in complexity, i.e. increase the number of theoretically possible mutants.
Foley G, et al. Identifying and engineering ancient variants of enzymes using Graphical Representation of Ancestral Sequence Predictions (GRASP). bioRxiv 2019. DOI:10.1101/2019.12.30.891457
Version 2020.05.05
Supported by the Australian Research Council DP160100865 to Bodén, Gillam, Kobe, and Rost.
GRASP is part of the GRASP-suite of tools for ancestral sequence reconstruction
Contact:
Mikael Bodén m.boden@uq.edu.au.
Technical issues or feature requests:
Gabriel Foley gabriel.foley@uqconnect.edu.au.