Table of Contents
“Just when you think it's finally settled, it isn't. ” | ||
--Solomon Short |
This guide assumes that you have basic working knowledge of Unix systems, know the basic principles of sequencing (and sequence assembly) and what assemblers do. Furthermore, it is advised to read through the main documentation of the assembler as this is really just a getting started guide.
For working parameter settings for assemblies involving 454 and / or Solexa data, please also read the MIRA help files dedicated to these platforms.
This example assumes that you have a few sequences in FASTA format that may or may not have been preprocessed - that is, where sequencing vector has been cut back or masked out. If quality values are also present in a fasta like format, so much the better.
We need to give a name to our project: throughout this example, we will assume that the sequences we are working with are from Bacillus chocorafoliensis (or short: Bchoc); a well known, chocolate-adoring bug from the Bacillus family which is able to make a couple of hundred grams of chocolate vanish in just a few minutes.
Our project will therefore be named 'bchoc'.
"Do I have enough memory?" has been one of the most often asked question in former times. To answer this question, please use miramem which will give you an estimate. Basically, you just need to start the program and answer the questions, for more information please refer to the corresponding section in the main MIRA documentation.
Take this estimate with a grain of salt, depending on the sequences properties, variations in the estimate can be +/- 30%.
The following steps will allow to quickly start a simple assembly if your sequencing provider gave you data which was pre-clipped or pre-screened for vector sequence:
$
mkdir bchoc_assembly1
$
cd bchoc_assembly1
bchoc_assembly1$
cp /your/path/sequences.fasta bchoc_in.sanger.fasta
bchoc_assembly1$
cp /your/path/qualities.someextension bchoc_in.sanger.fasta.qual
bchoc_assembly1$
mira --project=bchoc --job=denovo,genome,accurate,sanger --fasta
Explanation: we created a directory for the assembly, copied the sequences into it (to make things easier for us, we named the file directly in a format suitable for mira to load it automatically) and we also copied quality values for the sequences into the same directory. As last step, we started mira with options telling it that
our project is named 'bchoc' and hence, input and output files will have this as prefix;
the data is in a FASTA formatted file;
the data should be assembled de-novo as a genome at an assembly quality level of accurate and that the reads we are assembling were generated with Sanger technology.
By giving mira the project name 'bchoc'
(--project=bchoc
) and naming sequence file with
an appropriate extension _in.sanger.fasta
, mira
automatically loaded that file for assembly. When there are
additional quality values available
(bchoc_in.sanger.fasta.qual
), these are also
automatically loaded and used for the assembly.
![]() | Note |
---|---|
If there is no file with quality values available, MIRA will stop immediately. You will need to provide parameters to the command line which explicitly switch off loading and using quality files. |
![]() | Warning |
---|---|
Not using quality values is NOT recommended. Read the corresponding section in the MIRA reference manual. |
If your sequencing provider gave you data which was NOT pre-clipped for vector sequence, you can do this yourself in a pretty robust manner using SSAHA2 -- or the successor, SMALT -- from the Sanger Centre. You just need to know which sequencing vector the provider used and have its sequence in FASTA format (ask your provider).
Note that this screening is a valid method for any type of Sanger sequencing vectors, 454 adaptors, Solexa adaptors and paired-end adaptors etc.
For SSAHA2 follow these steps (most are the same as in the example above):
$
mkdir bchoc_assembly1
$
cd bchoc_assembly1
bchoc_assembly1$
cp /your/path/sequences.fasta bchoc_in.sanger.fasta
bchoc_assembly1$
cp /your/path/qualities.someextension bchoc_in.sanger.fasta.qual
bchoc_assembly1$
ssaha2 -output ssaha2 -kmer 8 -skip 1 -seeds 1 -score 12 -cmatch 9 -ckmer 6 /path/where/the/vector/data/resides/vector.fasta bchoc_in.sanger.fasta > bchoc_ssaha2vectorscreen_in.txt
bchoc_assembly1$
mira -project=bchoc -job=denovo,genome,accurate,sanger -fasta SANGER_SETTINGS -CL:msvs=yes
Explanation: there are just two differences to the example above:
calling SSAHA2 to generate a file which contains information on the vector sequence hitting your sequences.
telling mira with SANGER_SETTINGS
-CL:msvs=yes
to load this vector screening data for
Sanger data
For SMALT, the only difference is that you use SMALT for generating the vector-screen file and ask SMALT to generate it in SSAHA2 format. As SMALT works in two steps (indexing and then mapping), you also need to perform it in two steps and then call MIRA. E.g.:
bchoc_assembly1$
smalt index -k 7 -s 1 smaltidxdb /path/where/the/vector/data/resides/vector.fasta
bchoc_assembly1$
smalt map -f ssaha -d -1 -m 7 smaltidxdb bchoc_in.sanger.fasta > bchoc_smaltvectorscreen_in.txt
bchoc_assembly1$
mira -project=bchoc -job=denovo,genome,accurate,sanger -fasta SANGER_SETTINGS -CL:msvs=yes
![]() | Note |
---|---|
Please note that, due to subtle differences between output of SSAHA2
(in ssaha2 format) and SMALT (in ssaha2 format), MIRA identifies the
source of the screening (and the parsing method it needs) by the
name of the screen file. Therefore, screens done with SSAHA2 need to
have the postfix *_ssaha2vectorscreen_in.txt in
the file name and screens done with SMALT need
*_smaltvectorscreen_in.txt .
|
Mira can be used in many different ways: building assemblies from scratch, performing reassembly on existing projects, assembling sequences from closely related strains, assembling sequences against an existing backbone (mapping assembly), etc.pp. Mira comes with a number of quick switches, i.e., switches that turn on parameter combinations which should be suited for most needs.
E.g.: mira --project=foobar --job=sanger --fasta
-highlyrepetitive
The line above will tell mira that our project will have the general
name foobar and that the sequences are to be loaded
from FASTA files, the sequence input file being
named foobar_in.sanger.fasta
(and sequence quality
file, if
available, foobar_in.sanger.fasta.qual
. The reads
come from Sanger technology and mira is prepared for the genome
containing nasty repeats. The result files will be in a directory
named foobar_results
, statistics about the assembly
will be available in the foobar_info
directory
like, e.g., a summary of contig statistics in
foobar_info/foobar_info_contigstats.txt
. Notice
that the --job= switch is missing some
specifications, mira will automatically fill in the remaining defaults
(i.e., denovo,genome,accurate in the example above).
E.g.: mira --project=foobar --job=mapping,accurate,sanger
--fasta --highlyrepetitive
This is the same as the previous example except mira will perform a
mapping assembly in 'accurate' quality of the sequences against a
backbone sequence(s). mira will therefore additionally load the backbone
sequence(s) from the file foobar_backbone_in.fasta
(FASTA being the default type of backbone sequence to be loaded) and, if
existing, quality values for the backbone
from foobar_backbone_in.fasta.qual
.
E.g.: mira --project=foobar --job=mapping,accurate,sanger
--fasta --highlyrepetitive -SB:bft=gbf
As above, except we have added an extensive
switch ( [-SB:bft]) to tell mira that the backbones
are in a GenBank format file (GBF). MIRA will therefore load the
backbone sequence(s) from the file
foobar_backbone_in.gbf
. Note that the GBF file can
also contain multiple entries, i.e., it can be a GBFF file.
E.g.: mira --project=foobar --job=mapping,accurate,sanger
--fastq --highlyrepetitive -SB:bft=gbf
As above, except we have changed the input type for all files from FASTA to FASTQ.
This feature is in its infancy, presently only the SKIM algorithm uses
multiple threads. Setting the number of processes for this stage can be
done via the [-GE:not]
parameter. E.g. -GE:not=4
to use 4 threads.
A simple GAP4 project will do nicely. Please take care of the following: You need already preprocessed experiment / fasta / phd files, i.e., at least the sequencing vector should have been tagged (in EXP files) or masked out (FASTA or PHD files). It would be nice if some kind of not too lazy quality clipping had also been done for the EXP files, pregap4 should do this for you.
Step 1: Create a file of filenames (named
mira_in.fofn
) for the project you wish to
assemble. The file of filenames should contain the newline
separated names of the EXP-files and nothing else.
Step 2: Execute the mira assembly, eventually using command line options or output redirection:
$
/path/to/the/mira/package/mira
... other options ...
or simply
$
mira
... other options ...
if MIRA is in a directory which is in your PATH. The result of the
assembly will now be in directory
named mira_results
where you will
find mira_out.caf
, mira_out.html
etc. or in gap4 direct assembly format in
the mira_out.gap4da
sub-directory.
Step 3a: (This is not recommended anymore) Change to the gap4da directory and start gap4:
$
cd mira_results/mira_out.gap4da
mira_results/mira_out.gap4da$
gap4
choose the menu 'File->New' and enter a name for your new database (like 'demo'). Then choose the menu 'Assembly->Directed assembly'. Enter the text 'fofn' in the entry labelled Input readings from List or file name and enter the text 'failures' into the entry labelled Save failures to List or file name. Press "OK".
That's it.
Step 3b: (Recommended) As an alternative to step 3a, one can use the caf2gap converter (see below)
mira_results$
caf2gap -project demo -version 0 -ace mira_out.caf
mira_results$
gap4 DEMO.0
Out-of-the box example. MIRA comes with a few really small toy project to test usability on a given system. Go to the minidemo directory and follow the instructions given in the section for own projects above, but start with step 2. Eventually, you might want to start mira while redirecting the output to a file for later analysis.
It is sometimes wanted to reassemble a project that has already been edited, for example when hidden data in reads has been uncovered or when some repetitive bases have been tagged manually. The canonical way to do this is by using CAF files as data exchange format and the caf2gap and gap2caf converters available from the Sanger Centre (http://www.sanger.ac.uk/Software/formats/CAF/).
![]() | Warning |
---|---|
The project will be completely reassembled, contig joins or breaks that have been made in the GAP4 database will be lost, you will get an entirely new assembly with what mira determines to be the best assembly. |
Step 1: Convert your GAP4 project with the gap2caf tool. Assuming
that the assembly is in the GAP4
database CURRENT.0
, convert it with the
gap2caf tool:
$
gap2caf -project CURRENT -version 0 -ace > newstart_in.caf
The name "newstart" will be the project name of the new assembly project.
Step 2: Start mira with the -caf option and tell it the name of your new reassembly project:
$
mira -caf=newstart
(and other options like --job etc. at will.)
Step 3: Convert the resulting CAF file
newstart_assembly/newstart_d_results/newstart_out.caf
to a gap4 database format as explained above and start gap4 with
the new database:
$
cd newstart_assembly/newstart_d_results
newstart_assembly/newstart_d_results$
caf2gap -project reassembled -version 0 -ace newstart_out.caf
newstart_assembly/newstart_d_results$
gap4 REASSEMBLED.0
One useful features of mira is the ability to assemble against already existing reference sequences or contigs (also called a mapping assembly). The parameters that control the behaviour of the assembly in these cases are in the [-STRAIN/BACKBONE] section of the parameters.
Please have a look at the example in the minidemo/bbdemo2
directory
which maps sequences from C.jejuni RM1221 against (parts of) the genome
of C.jejuni NCTC1168.
There are a few things to consider when using backbone sequences:
Backbone sequences can be as long as needed! They are not subject to normal read length constraints of a maximum of 10k bases. That is, if one wants to load one or several entire chromosomes of a bacterium or lower eukaryote as backbone sequence(s), this is just fine.
Backbone sequences can be single sequences like provided by, e.g., FASTA, FASTQ or GenBank files. But backbone sequences also can be whole assemblies when they are provided as, e.g., CAF format. This opens the possibility to perform semi-hybrid assemblies by assembling first reads from one sequencing technology de-novo (e.g. 454) and then map reads from another sequencing technology (e.g. Solexa) to the whole 454 alignment instead of mapping it to the 454 consensus.
A semi-hybrid assembly will therefore contain, like a hybrid assembly, the reads of both sequencing technologies.
Backbone sequences will not be reversed! They will always appear in forward direction in the output of the assembly. Please note: if the backbone sequence consists of a CAF file that contain contigs which contain reversed reads, then the contigs themselves will be in forward direction. But the reads they contain that are in reverse complement direction will of course also stay reverse complement direction.
Backbone sequences will not not be assembled together! That is, if a sequence of the backbones has a perfect overlap with another backbone sequence, they will still not be merged.
Reads are assembled to backbones in a first come, first served scattering strategy.
Suppose you have two identical backbones and one read which would match both, then the read would be mapped to the first backbone. If you had two (almost) identical reads, the first read would go to the first backbone, the second read to the second backbone. With three almost identical reads, the first backbone would get two reads, the second backbone one read.
Only in backbones loaded from CAF files: contigs made out of single reads (singlets) loose their status as backbones and will be returned to the normal read pool for the assembly process. That is, these sequences will be assembled to other backbones or with each other.
Examples for using backbone sequences:
Example 1: assume you have a genome of an existing organism. From that, a mutant has been made by mutagenesis and you are skimming the genome in shotgun mode for mutations. You would generate for this a straindata file that gives the name of the mutant strain to the newly sequenced reads and simply assemble those against your existing genome, using the following parameters:
-SB:lsd=yes:lb=yes:bsn=
nameOriginalStrain
:bft=caf|fasta|gbf
When loading backbones from CAF, the qualities of the consensus bases will be calculated by mira according normal consensus computing rules. When loading backbones from FASTA or GBF, one can set the expected overall quality of the sequences (e.g. 1 error in 1000 bases = quality of 30) with [-SB:bbq=30]. It is recommended to have the backbone quality at least as high as the [-CO:mgqrt] value, so that mira can automatically detect and report SNPs.
Example 2: suppose that you are in the process of performing a shotgun sequencing and you want to determine the moment when you got enough reads. One could make a complete assembly each day when new sequences arrive. However, starting with genomes the size of a lower eukaryote, this may become prohibitive from the computational point of view. A quick and efficient way to resolve this problem is to use the CAF file of the previous assembly as backbone and simply add the new reads to the pool. The number of singlets remaining after the assembly versus the total number of reads of the project is a good measure for the coverage of the project.
Example 3: in EST assembly with miraSearchESTSNPs, existing cDNA
sequences can also be useful when added to the project during step
3 (in the file step3_in.par
). They will
provide a framework to which mRNA-contigs built in previous steps
will be assembled against, allowing for a fast evaluation of the
results. Additionally, they provide a direction for the assembled
sequences so that one does not need to invert single contigs by
hand afterwards.
(To be expanded)
This can have two causes:
if you work with a 32 bit executable of caf2gap, it might very well be that the converter needs more memory than can be handled by 32 bit. Only solution: switch to a 64 bit executable of caf2gap.
you compiled caf2gap with a caftools version prior to 2.0.1 and
then caf2gap throws segmentation errors
. Simply grab the
newest version of the caftools (at least 2.0.2) at
ftp://ftp.sanger.ac.uk/pub/PRODUCTION_SOFTWARE/src/ and compile the whole
package. caf2gap will be contained therein.
caf2gap has currently (as of version 2.0.2) a bug that turns around all features in reverse direction during the conversion from CAF to a gap4 project. There is a fix available, please contact me for further information (until I find time to describe it here).