Short usage introduction to MIRA3

Bastien Chevreux

MIRA Version 3.4.1.1

Document revision $Id$

Table of Contents

1. Important notes
2. Quick start for the impatient
2.1. Estimating memory needs
2.2. Preparing and starting an assembly from scratch with FASTA files
2.2.1. With data pre-clipped or pre-screened for vector sequence
2.2.2. Using SSAHA2 / SMALT to screen for vector sequence
3. Calling mira from the command line
4. Using multiple processors
5. Usage examples
5.1. Assembly from scratch with GAP4 and EXP files
5.2. Reassembly of GAP4 edited projects
5.3. Using backbones to perform a mapping assembly against a reference sequence
6. Troubleshooting
6.1. caf2gap cannot convert the result of a large assembly?
6.2. Reverse GenBank features are in forward direction in a gap4 project
 

Just when you think it's finally settled, it isn't.

 
 --Solomon Short

This guide assumes that you have basic working knowledge of Unix systems, know the basic principles of sequencing (and sequence assembly) and what assemblers do. Furthermore, it is advised to read through the main documentation of the assembler as this is really just a getting started guide.

1.  Important notes

For working parameter settings for assemblies involving 454 and / or Solexa data, please also read the MIRA help files dedicated to these platforms.

2.  Quick start for the impatient

This example assumes that you have a few sequences in FASTA format that may or may not have been preprocessed - that is, where sequencing vector has been cut back or masked out. If quality values are also present in a fasta like format, so much the better.

We need to give a name to our project: throughout this example, we will assume that the sequences we are working with are from Bacillus chocorafoliensis (or short: Bchoc); a well known, chocolate-adoring bug from the Bacillus family which is able to make a couple of hundred grams of chocolate vanish in just a few minutes.

Our project will therefore be named 'bchoc'.

2.1.  Estimating memory needs

"Do I have enough memory?" has been one of the most often asked question in former times. To answer this question, please use miramem which will give you an estimate. Basically, you just need to start the program and answer the questions, for more information please refer to the corresponding section in the main MIRA documentation.

Take this estimate with a grain of salt, depending on the sequences properties, variations in the estimate can be +/- 30%.

2.2.  Preparing and starting an assembly from scratch with FASTA files

2.2.1.  With data pre-clipped or pre-screened for vector sequence

The following steps will allow to quickly start a simple assembly if your sequencing provider gave you data which was pre-clipped or pre-screened for vector sequence:

$ mkdir bchoc_assembly1
$ cd bchoc_assembly1
bchoc_assembly1$ cp /your/path/sequences.fasta bchoc_in.sanger.fasta
bchoc_assembly1$ cp /your/path/qualities.someextension bchoc_in.sanger.fasta.qual
bchoc_assembly1$ mira --project=bchoc --job=denovo,genome,accurate,sanger --fasta

Explanation: we created a directory for the assembly, copied the sequences into it (to make things easier for us, we named the file directly in a format suitable for mira to load it automatically) and we also copied quality values for the sequences into the same directory. As last step, we started mira with options telling it that

  • our project is named 'bchoc' and hence, input and output files will have this as prefix;

  • the data is in a FASTA formatted file;

  • the data should be assembled de-novo as a genome at an assembly quality level of accurate and that the reads we are assembling were generated with Sanger technology.

By giving mira the project name 'bchoc' (--project=bchoc) and naming sequence file with an appropriate extension _in.sanger.fasta, mira automatically loaded that file for assembly. When there are additional quality values available (bchoc_in.sanger.fasta.qual), these are also automatically loaded and used for the assembly.

[Note]Note
If there is no file with quality values available, MIRA will stop immediately. You will need to provide parameters to the command line which explicitly switch off loading and using quality files.
[Warning]Warning
Not using quality values is NOT recommended. Read the corresponding section in the MIRA reference manual.

2.2.2.  Using SSAHA2 / SMALT to screen for vector sequence

If your sequencing provider gave you data which was NOT pre-clipped for vector sequence, you can do this yourself in a pretty robust manner using SSAHA2 -- or the successor, SMALT -- from the Sanger Centre. You just need to know which sequencing vector the provider used and have its sequence in FASTA format (ask your provider).

Note that this screening is a valid method for any type of Sanger sequencing vectors, 454 adaptors, Solexa adaptors and paired-end adaptors etc.

For SSAHA2 follow these steps (most are the same as in the example above):

$ mkdir bchoc_assembly1
$ cd bchoc_assembly1
bchoc_assembly1$ cp /your/path/sequences.fasta bchoc_in.sanger.fasta
bchoc_assembly1$ cp /your/path/qualities.someextension bchoc_in.sanger.fasta.qual
bchoc_assembly1$ ssaha2 -output ssaha2 
  -kmer 8 -skip 1 -seeds 1 -score 12 -cmatch 9 -ckmer 6 
  /path/where/the/vector/data/resides/vector.fasta 
  bchoc_in.sanger.fasta > bchoc_ssaha2vectorscreen_in.txt
bchoc_assembly1$ mira -project=bchoc -job=denovo,genome,accurate,sanger -fasta SANGER_SETTINGS -CL:msvs=yes

Explanation: there are just two differences to the example above:

  • calling SSAHA2 to generate a file which contains information on the vector sequence hitting your sequences.

  • telling mira with SANGER_SETTINGS -CL:msvs=yes to load this vector screening data for Sanger data

For SMALT, the only difference is that you use SMALT for generating the vector-screen file and ask SMALT to generate it in SSAHA2 format. As SMALT works in two steps (indexing and then mapping), you also need to perform it in two steps and then call MIRA. E.g.:

bchoc_assembly1$ smalt index -k 7 -s 1 smaltidxdb /path/where/the/vector/data/resides/vector.fasta
bchoc_assembly1$ smalt map -f ssaha -d -1 -m 7 smaltidxdb bchoc_in.sanger.fasta  > bchoc_smaltvectorscreen_in.txt
bchoc_assembly1$ mira -project=bchoc -job=denovo,genome,accurate,sanger -fasta SANGER_SETTINGS -CL:msvs=yes
[Note]Note
Please note that, due to subtle differences between output of SSAHA2 (in ssaha2 format) and SMALT (in ssaha2 format), MIRA identifies the source of the screening (and the parsing method it needs) by the name of the screen file. Therefore, screens done with SSAHA2 need to have the postfix *_ssaha2vectorscreen_in.txt in the file name and screens done with SMALT need *_smaltvectorscreen_in.txt.

3.  Calling mira from the command line

Mira can be used in many different ways: building assemblies from scratch, performing reassembly on existing projects, assembling sequences from closely related strains, assembling sequences against an existing backbone (mapping assembly), etc.pp. Mira comes with a number of quick switches, i.e., switches that turn on parameter combinations which should be suited for most needs.

E.g.: mira --project=foobar --job=sanger --fasta -highlyrepetitive

The line above will tell mira that our project will have the general name foobar and that the sequences are to be loaded from FASTA files, the sequence input file being named foobar_in.sanger.fasta (and sequence quality file, if available, foobar_in.sanger.fasta.qual. The reads come from Sanger technology and mira is prepared for the genome containing nasty repeats. The result files will be in a directory named foobar_results, statistics about the assembly will be available in the foobar_info directory like, e.g., a summary of contig statistics in foobar_info/foobar_info_contigstats.txt. Notice that the --job= switch is missing some specifications, mira will automatically fill in the remaining defaults (i.e., denovo,genome,accurate in the example above).

E.g.: mira --project=foobar --job=mapping,accurate,sanger --fasta --highlyrepetitive

This is the same as the previous example except mira will perform a mapping assembly in 'accurate' quality of the sequences against a backbone sequence(s). mira will therefore additionally load the backbone sequence(s) from the file foobar_backbone_in.fasta (FASTA being the default type of backbone sequence to be loaded) and, if existing, quality values for the backbone from foobar_backbone_in.fasta.qual.

E.g.: mira --project=foobar --job=mapping,accurate,sanger --fasta --highlyrepetitive -SB:bft=gbf

As above, except we have added an extensive switch ( [-SB:bft]) to tell mira that the backbones are in a GenBank format file (GBF). MIRA will therefore load the backbone sequence(s) from the file foobar_backbone_in.gbf. Note that the GBF file can also contain multiple entries, i.e., it can be a GBFF file.

E.g.: mira --project=foobar --job=mapping,accurate,sanger --fastq --highlyrepetitive -SB:bft=gbf

As above, except we have changed the input type for all files from FASTA to FASTQ.

4.  Using multiple processors

This feature is in its infancy, presently only the SKIM algorithm uses multiple threads. Setting the number of processes for this stage can be done via the [-GE:not] parameter. E.g. -GE:not=4 to use 4 threads.

5.  Usage examples

5.1.  Assembly from scratch with GAP4 and EXP files

A simple GAP4 project will do nicely. Please take care of the following: You need already preprocessed experiment / fasta / phd files, i.e., at least the sequencing vector should have been tagged (in EXP files) or masked out (FASTA or PHD files). It would be nice if some kind of not too lazy quality clipping had also been done for the EXP files, pregap4 should do this for you.

  1. Step 1: Create a file of filenames (named mira_in.fofn) for the project you wish to assemble. The file of filenames should contain the newline separated names of the EXP-files and nothing else.

  2. Step 2: Execute the mira assembly, eventually using command line options or output redirection:

    $ /path/to/the/mira/package/mira ... other options ...

    or simply

    $ mira ... other options ...

    if MIRA is in a directory which is in your PATH. The result of the assembly will now be in directory named mira_results where you will find mira_out.caf, mira_out.html etc. or in gap4 direct assembly format in the mira_out.gap4da sub-directory.

  3. Step 3a: (This is not recommended anymore) Change to the gap4da directory and start gap4:

    $ cd mira_results/mira_out.gap4da
    mira_results/mira_out.gap4da$ gap4

    choose the menu 'File->New' and enter a name for your new database (like 'demo'). Then choose the menu 'Assembly->Directed assembly'. Enter the text 'fofn' in the entry labelled Input readings from List or file name and enter the text 'failures' into the entry labelled Save failures to List or file name. Press "OK".

    That's it.

  4. Step 3b: (Recommended) As an alternative to step 3a, one can use the caf2gap converter (see below)

    mira_results$ caf2gap -project demo -version 0 -ace mira_out.caf
    mira_results$ gap4 DEMO.0

Out-of-the box example.  MIRA comes with a few really small toy project to test usability on a given system. Go to the minidemo directory and follow the instructions given in the section for own projects above, but start with step 2. Eventually, you might want to start mira while redirecting the output to a file for later analysis.

5.2.  Reassembly of GAP4 edited projects

It is sometimes wanted to reassemble a project that has already been edited, for example when hidden data in reads has been uncovered or when some repetitive bases have been tagged manually. The canonical way to do this is by using CAF files as data exchange format and the caf2gap and gap2caf converters available from the Sanger Centre (http://www.sanger.ac.uk/Software/formats/CAF/).

[Warning]Warning
The project will be completely reassembled, contig joins or breaks that have been made in the GAP4 database will be lost, you will get an entirely new assembly with what mira determines to be the best assembly.
  • Step 1: Convert your GAP4 project with the gap2caf tool. Assuming that the assembly is in the GAP4 database CURRENT.0, convert it with the gap2caf tool:

    $ gap2caf -project CURRENT -version 0 -ace > newstart_in.caf

    The name "newstart" will be the project name of the new assembly project.

  • Step 2: Start mira with the -caf option and tell it the name of your new reassembly project:

    $ mira -caf=newstart

    (and other options like --job etc. at will.)

  • Step 3: Convert the resulting CAF file newstart_assembly/newstart_d_results/newstart_out.caf to a gap4 database format as explained above and start gap4 with the new database:

    $ cd newstart_assembly/newstart_d_results
    newstart_assembly/newstart_d_results$ caf2gap -project reassembled -version 0 -ace newstart_out.caf
    newstart_assembly/newstart_d_results$ gap4 REASSEMBLED.0

5.3.  Using backbones to perform a mapping assembly against a reference sequence

One useful features of mira is the ability to assemble against already existing reference sequences or contigs (also called a mapping assembly). The parameters that control the behaviour of the assembly in these cases are in the [-STRAIN/BACKBONE] section of the parameters.

Please have a look at the example in the minidemo/bbdemo2 directory which maps sequences from C.jejuni RM1221 against (parts of) the genome of C.jejuni NCTC1168.

There are a few things to consider when using backbone sequences:

  1. Backbone sequences can be as long as needed! They are not subject to normal read length constraints of a maximum of 10k bases. That is, if one wants to load one or several entire chromosomes of a bacterium or lower eukaryote as backbone sequence(s), this is just fine.

  2. Backbone sequences can be single sequences like provided by, e.g., FASTA, FASTQ or GenBank files. But backbone sequences also can be whole assemblies when they are provided as, e.g., CAF format. This opens the possibility to perform semi-hybrid assemblies by assembling first reads from one sequencing technology de-novo (e.g. 454) and then map reads from another sequencing technology (e.g. Solexa) to the whole 454 alignment instead of mapping it to the 454 consensus.

    A semi-hybrid assembly will therefore contain, like a hybrid assembly, the reads of both sequencing technologies.

  3. Backbone sequences will not be reversed! They will always appear in forward direction in the output of the assembly. Please note: if the backbone sequence consists of a CAF file that contain contigs which contain reversed reads, then the contigs themselves will be in forward direction. But the reads they contain that are in reverse complement direction will of course also stay reverse complement direction.

  4. Backbone sequences will not not be assembled together! That is, if a sequence of the backbones has a perfect overlap with another backbone sequence, they will still not be merged.

  5. Reads are assembled to backbones in a first come, first served scattering strategy.

    Suppose you have two identical backbones and one read which would match both, then the read would be mapped to the first backbone. If you had two (almost) identical reads, the first read would go to the first backbone, the second read to the second backbone. With three almost identical reads, the first backbone would get two reads, the second backbone one read.

  6. Only in backbones loaded from CAF files: contigs made out of single reads (singlets) loose their status as backbones and will be returned to the normal read pool for the assembly process. That is, these sequences will be assembled to other backbones or with each other.

Examples for using backbone sequences:

  • Example 1: assume you have a genome of an existing organism. From that, a mutant has been made by mutagenesis and you are skimming the genome in shotgun mode for mutations. You would generate for this a straindata file that gives the name of the mutant strain to the newly sequenced reads and simply assemble those against your existing genome, using the following parameters:

    -SB:lsd=yes:lb=yes:bsn=nameOriginalStrain:bft=caf|fasta|gbf

    When loading backbones from CAF, the qualities of the consensus bases will be calculated by mira according normal consensus computing rules. When loading backbones from FASTA or GBF, one can set the expected overall quality of the sequences (e.g. 1 error in 1000 bases = quality of 30) with [-SB:bbq=30]. It is recommended to have the backbone quality at least as high as the [-CO:mgqrt] value, so that mira can automatically detect and report SNPs.

  • Example 2: suppose that you are in the process of performing a shotgun sequencing and you want to determine the moment when you got enough reads. One could make a complete assembly each day when new sequences arrive. However, starting with genomes the size of a lower eukaryote, this may become prohibitive from the computational point of view. A quick and efficient way to resolve this problem is to use the CAF file of the previous assembly as backbone and simply add the new reads to the pool. The number of singlets remaining after the assembly versus the total number of reads of the project is a good measure for the coverage of the project.

  • Example 3: in EST assembly with miraSearchESTSNPs, existing cDNA sequences can also be useful when added to the project during step 3 (in the file step3_in.par). They will provide a framework to which mRNA-contigs built in previous steps will be assembled against, allowing for a fast evaluation of the results. Additionally, they provide a direction for the assembled sequences so that one does not need to invert single contigs by hand afterwards.

6.  Troubleshooting

(To be expanded)

6.1.  caf2gap cannot convert the result of a large assembly?

This can have two causes:

  1. if you work with a 32 bit executable of caf2gap, it might very well be that the converter needs more memory than can be handled by 32 bit. Only solution: switch to a 64 bit executable of caf2gap.

  2. you compiled caf2gap with a caftools version prior to 2.0.1 and then caf2gap throws segmentation errors. Simply grab the newest version of the caftools (at least 2.0.2) at ftp://ftp.sanger.ac.uk/pub/PRODUCTION_SOFTWARE/src/ and compile the whole package. caf2gap will be contained therein.

6.2.  Reverse GenBank features are in forward direction in a gap4 project

caf2gap has currently (as of version 2.0.2) a bug that turns around all features in reverse direction during the conversion from CAF to a gap4 project. There is a fix available, please contact me for further information (until I find time to describe it here).