Welcome

We present EPoS, a modular software framework for phylogenetic analysis and visualization. Existing phylogenetic software can be split into two groups. Algorithmic packages that provide computational methods for a specific problem, and visualization tools to analyse results. Many algorithmic tools lack in usability, as they are often command line based, while visualizations often suffer from poor graphical user interfaces. EPoS fills this gap by combining a powerful graphical user interface with a plugin system that allows simple integration of new algorithms, visualizations and data structures. A consistent interface is used to manage data or to start available computational methods, disconnecting the work flow from the data or applied methods.

Currently, EPoS provides algorithms for tree reconstruction, tree distances, consensus trees, and supertree methods, blast support and cluster integration. All methods are integrated into a pipeline system which allows combinations of methods that are executed sequentially, where the data flow is handled automatically by the system. Some of the currently supported methods are:

Alignments

  • ClustalW
  • Mafft
  • Muscle
  • TCoffee/MCoffee
  • Dialign TX

Tree Construction

  • RaxML
  • MrBayes
  • Paup Parsimony
  • NJ and Agglomerative Clustering

Trees

  • Consensus Trees
  • Maximum Agreement Subtrees
  • Supertree Construction

Beside the integration of the methods above, EPoS offers a Tree Comparator that can be used to visually compare trees, Blast support, NCBI GenBank integration, Taxonomy support, Scripting support and an easy way to integrate and use your compute cluster (Sun Grid Engine) to execute algorithms remotely.

Create Scripts with Code Completion in IntelliJ

I just found out there is an easy way to use IntelliJ IDEA to create and start Epos Scripts. This allows you to use full code completion and documentation in the code editor 🙂

To set things up, install Epos and IntelliJ IDEA and create a new Maven Project. In the Projects pom.xml, add the epos repository and a single dependency to the full Epos Application:


<repositories>
  <repository>
    <id>bioinf-jena</id>
    <url>http://bio.informatik.uni-jena.de/artifactory/repo</url>
    <name>Bioinf Jena</name>
  </repository>
</repositories>
<dependencies>
  <dependency>
    <groupId>de.unijena.bioinf</groupId>
    <artifactId>epos</artifactId>
    <version>0.9</version>
  </dependency>
</dependencies>

Now create a new Groovy Script in src/main/java, say TestScript.groovy and enter the following code:


import epos.model.tree.io.Newick

def tree = Newick.getTreeFromString("((A,B),C);")
tree.getRoot().depthFirstIterator().each { n->
  println "Node : ${n}"
}

Notice that you should already have full code completion now. You can also open the “Maven Projects” tab on the right and click on “Download Sources and Documentation” to get access to the Epos source code and JavaDoc comments (the download might take a while). For example, move the cursor over the depthFirstIterator call and press Ctrl-Q (or CMD-J on a Mac) to get JavaDoc comments. You can also press Ctrl-B (or CMD-B) to jump to the source.

Intellij Javadoc comment

Last but not least, we have to set up an External Tool to start our script right from within idea. You need this if you want to access epos instances from within the script, for example the workspace or the job handler. In the IDEA main menu, select File->Settings and search for External Tools. Create a new External tool and enter the following:

  • An arbitrary Name and Description (i.e. Epos)
  • Programm: The full path to the epos executable
  • Parameters: -s –db MEM –script $FilePath$
  • Working Directory: $FileDir$

This will start the given Epos installation and pass the path to the file currently opened in IntelliJ as script parameter. The files directory (i.e. src/main/java ) is used as working directory.

External Tool Configuration

Now, in the the script, you can use the main menu Tool-> Epos to start the current Script in Epos (note that the name depends on the name you entered before).

Tutorial: Merging Alignments and RaxML Multi-Gene analysis

In this tutorial we will create two alignments and perform a multi-gene analysis using RaxML. To get things started lets fetch data from the NCBI. I randomly picked 5 Taxa from “A multigene phylogeny of the Dothideomycetes using four nuclear loci” for the SSU and LSU regions.

Start Epos and go to File->New->Fetch from GenBank. Add the following Accession numbers for SSU:

AY584667,DQ678000,DQ678012,DQ471039,DQ678043

and for LSU:

AY584643,DQ678053,DQ678064,DQ470987,DQ678097

(Note that you can copy paste the whole line, Epos will split by comma). Hit Add and then Fetch the sequences. You will end up with 10 new sequences in your workspace. Click on the Sequences button in the main toolbar and select the SSU sequences (Epos keeps the order, so the first five should be SSU while the last five are LSU, but you can also check the description of the sequences. You can also select the sequences, right-click and select Assign->Gene… to assign SSU or LSU as gene names). With the SSU sequences selected, click the Run button in the toolbar or right click and select Run. Select ClustalW (or any other alignment method). The ClustalW configuration appears. If ClustalW is not installed on you machine, click on Install Tool to specify the path to ClustalW or Auto-install. Set the result name to “SSU” and keep the rest of the  algorithm configuration. Now hit the Run button in the lower right corner. A job will be started on your local machine.

Repeat the steps for LSU. Select the LSU Sequences, click on Run and select ClustalW, set the result name to LSU and start the algorithm. Now show the Jobs manager by clicking on Jobs in the main toolbar. You should see your two ClustalW runs and they are probably already in the Done state. If not, wait until both jobs are “Done“. Now select the two jobs and press “Fetch” to load the results into your workspace. Click the Alignments button in the main toolbar and you will see a list of all alignment in your current workspace including the two ClustalW alignments for LSU and SSU.

Merge Alignemnts

Now comes the tricky part. We have to merge the alignments to perform a multi-gene analysis. Select both alignments and click the Merge button in the toolbar. In the dialog that appears you have to check the Use Taxon names box! Because we fetched the sequences from GenBank, the Accession number is used as sequence name and the Accession numbers, of course, are not matching up. For example, AY584667 is the SSU and AY584643 is the LSU. They should appear as one row in the merged alignment. To avoid heavy renaming, we can use the taxonomic information to create a proper merge. With the sequence information, we also fetched the taxonomic information from the NCBI Taxonomy Database. Now both AY584667 and AY584643 are linked to the same taxon: Acarosporina microspora. Checking Use Taxon Names force Epos to merge based on the taxonomic information instead of the raw sequence names. Click in Okey and Epos creates the merged alignment, which will appear in the list of alignments in your workspace.

To get an idea of what happend, double click the new merged alignment. Notice the black arrows above the alignment. These are annotations on the alignment that mark different regions. In this example, the annotations represent the SSU and LSU regions in the merged alignment.

To perform a multi-gene analysis, use the Run button in the toolbar of the alignment view. You will see a list of algorithms that work on alignments. Select RAxML BS and ML to perform a rapid RaxML analysis with bootstrapping. If RaxML is not yet installed, use the Install Tool button to install it. Select a model, for example, GTRCAT and you are good to go. To verify that you are doing a multi-gene analysis, click on the small right-arrow button above the RaxML configuration parameters. You will see a table of regions used for the analysis. Here you can also choose different models for the regions if you wish. Epos automatically uses all available alignment annotations as regions for RaxML. To start the algorithm, click the Run button in the lower right corner and open the Jobs window after successful submission. When the RaxML run is finished and in Done state, double click the job to fetch the results. Click on Trees in the main toolbar to view all the trees in your workspace. The RaxML tree is also in the list.

Merged Alignment with annotations

Tutorial: Add Cluster support

Epos Location Manager

The Epos framework supports the direct integration of compute clusters based on the Sun Grid Engine. Integrating a remote cluster allows you to submit all jobs directly to the cluster and fetch results later. For this to work, you need SSH access to a remote machine that is part of the cluster. The machine must be a Submit Host, the SGE commands qsub, qdel and qstat must be installed. If you already used your cluster, this is typically the machine that you use to submit your jobs.

Here we will go through the steps of configuring a remote machine to be used as an execution location in Epos, which allows you to not only run singel instances of your jobs remotly, removing load from your local machine, but also submit MPI jobs, i.e. RaxML or MrBayes. You will also be able to shut down Epos on you local machine without interrupting job executions. All the results can be fetched later. Also note that you can also use the Epos UI to configure your cluster and than then use the configured cluster from within Epos Scripts, for example, to submit a set of jobs without loading the data into your Epos workspace.

Configure the remote host

Let us quickly go through the steps to configure the remote cluster in Epos. The jobs and location management can be found on the main toolbars right side or in the main menu under Jobs. You will see two tabs, one that lists the jobs, and one that lists all configured locations. Click on the Locations tab and then on the plus button in the lower left corner.

This opens the “New Location” Wizard. You have to fill out the form to integrate your cluster. Specify a unique name that identifies your cluster, the remote host address and ssh port, your user name and password and two folders: Remote Folder is the location where Epos stores remote job data, Installation Folder is the directory that we use to install executables and tools for the cluster. Finally also check the update interval and the cluster architecture.

SGE Tools

After you click Next, Epos connects to the remote machine and tries to figure out the locations for the SGE tools. There is a good chance that Epos can automatically detect the right locations, but in case one of the paths can not be identified correctly, you will have to specify them manually. These path are the most crucial part of the cluster configuration. They point to the executables Epos uses when submitting or deleting jobs or checking status. If the automatic detection did not work, you can also start by specifying just the SGE_ROOT location and hitting Validate. Epos will then try to figure out the paths to the commands based on the specified SGE_ROOT.

The next section allows you to configure the locations of remote tools. For example, to configure ClustalW, open the ClustalW tab and specify the full path to the ClustalW executable. You can also use the Auto-Install function for one or all tools, but note that typically you want to specify paths to local installations that already exist.

Tool Configuration


Note that the latter is especially important if you want to submit MPI jobs, i.e. RaxML or MrBayes. Specify the path to the clusters mpi version of the tools because the Auto-Install feature does not come with MPI versions for the tools. The last tool in the list is Epos itself. You need a copy of Epos on the cluster to submit algorithms that are not based on external tools, i.e. the Neighbor Joining implementation.

Finally you have to adapt the SGE execution scripts. Three scripts are available by default: Epos, External Tool and MPI. The first script is used to submit algorithms that run in Epos and do not use an external tool. In contrast, the External Tool script runs jobs based on external tools while the MPI script is used to submit MPI jobs. You can use SGE paramters within the sript template. For example, if you want to specify a specific queue, say all.q,  to be use for your job, add

#$ -q all.q

in the script. This is even more important for MPI jobs. You probably have to make sure that a proper version of mpirun is used and that the MPI libraries can be found. When you use the SGE’s tight integration, you probably also want to specify

#$ -pe mpi 24

where mpi is the name of the your parallel environment and 24 is the number of slots you are requesting.

MPI SGE Configuration


When everything is configured, hit Finish and your remote location is created. Now you can start algorithms and change the location you want to submit jobs to in the lower right corner of the algorithm window. If you select a remote location, the submission dialog will ask you for name of the remote jobs and let you select and customize the sge script you want to use for execution.

Tutorial: NAD5

This is a mini-tutorial based on an example from “Gene und Stammbaeume” by Volker Knoop and Kai Mueller.

We will use Epos to create a Phylogeny of Marsupials based on the NAD5 Gene. The basic steps involved are: Collect initial data from the NCBI, search for homologue Sequences not covered by the initial search, find outgroup sequences, create an alignment and finally create a tree.

For this demo, we start with an empty Epos Workspace. If you want to create a new Workspace, use File->New Workspace…. and restart Epos.

NCBI Search for nd5 Metatheria

To get things started, we have to search the NCBI for sequences linked to nad5 in Metatheria. You can do the search in the NCBI Web interface and download the results in GenBank format, but for now, we use the integrated NCBI Search in Epos. Select File->New->Search GenBank… and enter “nd5 Metatheria” into the search field before you hit enter. At the time of writing, I ended up with 45 complete mitochondrial genome hit. For ease of simplicity, select all the hits (STRG/CMD-A in the table), and press Fetch Selected. All selected hit will now be imported into your workspace, including the taxonomic information linked to the sequences.

To see the imports and quickly check the taxonomy, select Sequences from the main toolbar. The sequence overview opens and you will see the 45 imported entries. Select all, and right-click on the table to get the context menu and choos Show Taxonomy… A window opens that shows the taxonomic hierarchy of the selected sequences entries. Verify that you have only sequences that are Metatheria.

Check the taxonomy

Right now, we have the complete genome sequences in the database, but we are interested only in the nd5 regions. We can use the Extract Features function, to get the DNA Sequences annotated as nd5. Again, select all sequences and hit the Extract Features button in the toolbar. The first dialog that appears asks you whether you want select from all features, or just the ones that are not unique. In our case, we want to see All Featrues but filter for CDS annotations. Type CDS into the filter field and hit Ok. Now we get a list of all CDS regions in all the selected sequences and you can check for the ones we are interested in. Use the search field in the lower right corner to filter for ND5 regions and click the checkboxes for the Features you want to extract. Before clicking Ok, make sure you checked the Keep Original Sequence Name box and Feature as Gene name. The first one ensures that the extracted sequences get the same name as the source sequence and the latter assigns a gene name to the extracted sequences.

Extract ND5 Features

Back in the sequence view, we have 45 new Sequences, but the gene name column is a little bit messed up. Not all the annotations where named nd5. To fix this, select the new extracted sequences, right click and select Assign->Gene… and assign ND5 to all sequences. By default you find new data at the bottom of the list, so the last 45 sequences are the extracted features.

Before we continue creating alignments and trees, it is time to get a little bit more organized. To create collections of Data, you can use virtual files and folders in the epos workspace. To create a collection of our extracted sequences, right clock in the virtual file system browser on the left side of the epos main window (the blue area). Create a new Folder and name it ND5 DNA. Now select the extracted sequences and drag-and-drop them into the new folder. The folder icon becomes green and the folder contains your sequences. To view to content, right-click in the folder and select Show Content or use the little toolbar on top of the filesystem window. Note that you can also drag and drop folders into algorithms.

Okey, so before we create the first alignment, we want to extend our dataset further. The initial text based search might not have found all homologue sequences and we also need to find appropriate outgroups. We use Blast to extend our set of sequences. Select one of the nd5 sequences and hit the Blast button in the toolbar. This loads the sequence into the Blast dialog. We now want to do two blast searches. One for Prototheria and one for Eutheria to, find other homologues and outgroup sequences. Type Prototheria in  the Taxon search bar to limit the search and add Prototheria to the Result name field before you click the submit button. When the job was successfully submitted, click on Show Jobs in the dialog that appears. This will show you the list of currently submitted jobs. Before you close the Blast search window, quickly submit another search. Set the taxonomic limit to Eutheria and resubmit the job.

Blast Query

When both jobs are in the Done state, select them and click on Fetch Results in the Jobs window. Epos will now download the blast results and import them into the workspace. Click on Blast Results in the main toolbar. You will see the two blast datasets. Lets start with the homologue hits. Double click the results for the Prototheria search and confirm the next dialog with Ok, you do not have to customize the blast result import for this example. The result view opens and you should see a few hits for the query sequence. I found:

Tachyglossus aculeatus complete mitochondrial genome

Zaglossus bruijni complete mitochondrial genome

Ornithorhynchus anatinus mitochondrial DNA, complete genome

Select the hits and click the Download button in the lower toolbar. This will fetch the associated records from GenBank. When Epos asks you what to do with the downloaded results, click on Import to import them into the current workspace. As the hits are also complete genome sequences, you have to repeat the feature extraction steps. Select the sequences, choose extract features in the toolbar, filter for CDS, search for the nadh5 features. Note that the hits might not be well annotated. I found the three ND5 regions by filtering the table for “5” and checking the “Description” column. You can also hide table columns by right clicking the table header and remove the columns you are not interested in. Enable Keep the original sequence name and extract. Now assign ND5 gene names to all the newly extracted sequences and drag and drop the extraction to the ND5 DNA folder). You can switch to the sequence view using the Sequences button in the main toolbar.

Blast Results for the Outgroup Selection

To select a few Outgroup sequences, open the results for the blast search restricted to Eutheria. There is a good chance you will not have hit the same sequences as this example. Just randomly picked three sequences. These are our Outgroup sequences.

Lemur catta

Talpa europaea

Mus terricolor

Again, these are all complete mitochondrial genomes, so we do the extraction step again. After you extracted the CDS regions and droped them into the ND5 DNA Folder, we can create the first Alignment and an initial tree.

To create the alignment, open the ND5 DNA Folder content by selecting the folder and clicking the small sequence button in the toolbar on top or right click the folder and select Show Content->Sequences. You will see the content of the folder. Select all the sequences, right click and select Run. You will see a list of algorithms that can be applied to a set of sequences. Because we want to compute a multiple sequence alignment, select one of the alignment methods. For a fast alignment I choose Mafft. All the sequences are set as input and you can keep the parameters at their default values for now. Epos will warn you if Mafft is not yet installed. If that is the case, click on the Install Tool Button and use the Auto-Installation feature for mafft. Now the Run button on the bottom right of the toolbar should be enabled and you can click it to start the computation. Open the Jobs windows and wait until the job state is “Done” before you double click the job to fetch the results.

ND5 DNA Alignment

After the results are fetched and imported into the workspace, you can choose the Alignments view to get a list of all available alignments. The one we just computed should occur at the bottom of the list. For now, we use the Alignment as is and just right click it and select Run again to compute a tree.

To get a quick result, choose NJ Tree to compute a neighbor joining tree from the alignment. When the NJ window opens, set the rooting method to “Outgroup” and select one of your outgroup sequences. You can go back to the sequence view to check for the right name. Now start the computation, it should just take a few seconds. Fetch the results and got to the Trees view using the main toolbar. Like all overviews in Epos the Trees view shows a list of all the trees in the current workspace and the computed NJ tree will be last in the list. Double click the tree to go the the detail view.

Note that the tree now shows the sequence names as labels. However, to get a better idea of what is where in the tree you can switch to a taxonomic labeling. Because you imported all your sequences from GenBank, they all come with taxonomic information, which is also imported into the Epos workspace. On the left side of the tree view, in the Nodes tab, turn on Taxon Names and the taxonomic naming will be used.

The first ND5 DNA Tree

Epos 0.9

Finally 🙂

Epos 0.9 SNAPSHOT Release

Okey, here is the beta release for version 0.9. New user interface, new modules, integrated blast viewer and taxonomy support. Please keep us informed about anything that is not working properly … and note that we are still heavily working on the tree comparator and taxon modules.

Feel free to use our bug tracker to submit bugs or feature requests. You can create you own account to stay informed about fixes on submitted bugs.

Hide Table Columns easily

columnHider

When you deal with large tables that provide a good number of columns, it is always nice if you can hide some of them. They should still be available in the TableModel, so this is clearly a task for the view.

It turns out, realizing a popup menu on the table header is quiet easy. We utilize the tables ColumnModel to change visible columns. Well, actually we do not just change the columns visibility but remove or readd it to the column model. In Epos, all this can be done using a simple utility method in the ComponentUtils. It creates a popup and adds the appropriate mouse listeners to the tables header.

Here is the utility code the creates the popup and adds the listener. Note that we evaluate and check for a popup click on both the pressed and the released mouse event. This is due to some platform specific differences.

Read the rest of this entry »

Glazedlists and column based filtering

tablefilterWhile I was working on the BlastViewer, I thought filtering by table column would be a nice feature to get rid of all the cr**y results you are not interested in. Most of the tables we create in Epos are based on GlazedLists as table model and data container. GlazedLists come with pretty nice filtering features out of the box. Filtering for some strings is straight forward, but for this I wanted something slightly more complicated. I want to filter a user specified column using some comparator and a value. Another important thing was that the filter should be easy to use, without any configuration. Here is what we do for the BlastResult table:


...
// create the table
final AdvancedTableFormat hitFormat = new BlastHitFormat();
TableFilter tableFilter = new TableFilter(hitFormat);
FilterList columnFilteredHits = new FilterList(filteredHits, tableFilter);
JTable hitTable = new JTable(new EventTableModel(columnFilteredHits, hitFormat));
...
// create the filter view
hitTableFilterView = new TableFilterView(tableFilter);

Basically we create a new TableFilter. The TableFilter implements a GlazedLists MatcherEditor, so we can apply it on a FilteredList. The FilteredList serves as container for the final table. To create the UI, we use our TableFilterView. The view starts as an empty component but you can call the addFilter() method to create a new filter.

Note that the current TableFilter works only on AdvancedTableFormat. This is due to the fact that the filter needs to known the columns data type, so your AdvancedTableFormat implementation must return valid types.

The TableFilter implementation has no dependencies and can be used with any GlazedLists table, but the UI currently has some dependencies to the Epos Component factory and it used JGoodies Formlayout, though these two can be easily removed with some minor tweaks in the View classes.

If you want to take a look at the implementatin, check the svn.I thought about putting a jar file here, but this thing is still work in progress and you can not use the jar anyways, as it does not contain the dependencies.

If you are interested or you have any questions, just contact me.

Epos Blast Viewer

blastviewer

We just put the epos blast viewer online, as a standalone webstart application on top of the epos framework. The app is small tool that is able to read NCBI blast results from XML files and display the data. You can directly download the sequence and store them on disk and you can export the results as default blast output or as CSV file.

IntelliJ class cloud

I was bored and played around in IntelliJ, which is open sourced now. I found the analyze class cloud button and here is the result for the Epos sources.
epos-cloud

Looks like ComponentUtils is an important class 🙂