Table of Contents
“The manual only makes sense after you learn the program. ” | ||
--Solomon Short |
mira
[--project=<name>]
[--cwd=<directory>]
[--job=arguments
]
[--fasta[=<filename>] | --fastq[=<filename>] | --caf[=<filename>] | --phd[=<filename>]]
[--notraceinfo]
[--noclipping[=...]]
[--highlyrepetitive]
[--lowqualitydata]
[--highqualitydata]
[--params=<filename>]
[-GENERAL:arguments
]
[-STRAIN/BACKBONE:arguments
]
[-ASSEMBLY:arguments
]
[-DATAPROCESSING:arguments
]
[-CLIPPING:arguments
]
[-SKIM:arguments
]
[-ALIGN:arguments
]
[-CONTIG:arguments
]
[-EDIT:arguments
]
[-MISC:arguments
]
[-DIRECTORY:arguments
]
[-FILENAME:arguments
]
[-OUTPUT:arguments
]
[COMMON_SETTINGS | SANGER_SETTINGS | 454_SETTINGS | IONTOR_SETTINGS | PACBIO_SETTINGS | SOLEXA_SETTINGS]
For an easy introduction on how to use mira, a number of tutorials with step-by-step instructions are available:
mira_usage
for basic Sanger assembly
mira_454
for basic 454 assembly
mira_iontor
for basic Ion Torrent assembly
mira_pacbio
for basic assembly of sequences
from Pacific Bioscience
mira_solexadev
for basic mapping assembly of Solexa data
mira_est
for some advice concerning assembly of EST sequence
(and miraSearchESTSNPs)
mira_hard
some notes on how to assemble 'hard'
data sets: EST data sets or genome projects for eukaryotes, but some
prokaryotes also qualify for this
mira_faq
with some frequently asked question
To use mira itself, one doesn't need very much:
Sequence data in EXP, CAF, PHD, FASTA or FASTQ format (ideally preprocessed)
Optionally: ancillary information in NCBI traceinfo XML format; ancillary information about strains in tab delimited format, vector screen information generated with ssaha2 or smalt.
Some memory and disk space. Actually lots of both if you are venturing into 454 or Solexa.
mira has three basic working modes: genome, EST or EST-reconstruction-and-SNP-detection. From version 2.4 on, there is only executable which supports all modes. The name with which this executable is called defines the working mode:
mira for assembly of genomic data as well as assembly of EST data from one or multiple strains / organisms
and
miraSearchESTSNPs for assembly of EST data from different strains (or organisms) and SNP detection within this assembly. This is the former miraEST program which was renamed as many people got confused regarding whether to use mira in est mode or miraEST.
Note that miraSearchESTSNPs is usually realised as a link to the mira executable, the executable decides by the name it was called with which module to start.
Parameters can be given on the command line or loaded via parameter files.
mira knows two basic parameter types: quick switches and extensive switches.
quick switches, also dubbed DWIM switches (for 'Do-What-I-Mean'), are easy to use-and-combine switches activating parameter collections for predefined tasks that will suit most people's needs.
extensive switches offer a way to set about any possible parameter to configure mira for any kind of special need. While the format of extensive switches might look a little bit strange, it is borrowed from the SGI C compiler options and allows both compact command lines as well as readable and / or script generated parameter files.
Due to the introduction of new sequencing technologies like 454, Solexa and ABI SOLiD, the extensive switches had to be split into two groups:
technology independent switches which control general behaviour of MIRA like, e.g., the number of assembly passes or file names etc.
technology dependent switches which control behaviour of algorithms where the sequencing technology plays a role. Example for this would be the minimum length of a read (like 200 for Sanger reads and 120 for 454 FLX reads).
More on this a bit further down in this documentation.
As example, a typical call of mira using quick switches and some tweaking with extended switches on the command line could look like this:
mira --job=denovo,genome,draft,sanger --fasta
SANGER_SETTINGS
-ALIGN:min_relative_score=70
-GENERAL:use_template_information=yes
-GENERAL:templateinsertsizeminimum=500:templateinsertsizemaximum=2500
or in short form
mira --job=denovo,genome,draft,sanger --fasta
SANGER_SETTINGS
-AL:mrs=70
-GE:uti=yes:tismin=500:tismax=2500
Please note that it is also perfectly legal to decompose the switches so that they can be used more easily in scripted environments (notice the multiple -GE in the following example):
mira --job=denovo,genome,draft,sanger --fasta
SANGER_SETTINGS
-AL:mrs=70
-GE:uti=yes
-GE:tismin=500
-GE:tismax=2500
These switches are 'Do-What-I-Mean' parameter collections for predefined tasks which should suit most people's needs. You might still need a few of the extensive switches, but not too many anymore.
Important note 1: For de-novo assembly of genomes, these switches are optimised for 'decent' coverages that are commonly seen to get you something useful, i.e., ≥ 7x for Sanger, >=18x for 454 FLX or Titanium, ≥ 25x for 454 GS20 and ≥ 30x for Solexa. Should you venture into lower coverage or extremely high coverage (say, >=60x for 454), you will need to adapt a few parameters via extensive switches.
Important note 2: For some switches, the order of appearance in the command line (or parameter file) is important. This is because the quick switches are realised internally as a collection of extensive switches that will overwrite any previously manually set extensive switch. It is generally a good idea to place switches in the order as described in this documentation, that is: first the order dependent quick switches, then other quick switches, then all the other extensive switches.
![]() | Warning |
---|---|
E.g. always write |
The main one-stop-switches for most assemblies. You can choose between two different assembly methods (denovo or mapping), two different assembly types (genome or est), two different quality grades (draft or accurate) and mix different sequencing technologies (sanger, 454, iontor, solexa). This switch is explained in more detail in the subsection "The --job= switch in detail".
A modifier switch for genome data that is deemed to be highly repetitive. The assemblies will run slower due to more iterative cycles that give mira a chance to resolve nasty repeats.
Switches off clipping options for given sequencing technologies. Technologies can be sanger, 454, iontor, solexa or solid. Multiple entries separated by comma.
Note that [-CL:pec] and the chimera clipping [-CL:ascdc] are not switched off by this parameter and should be switched off separately.
Examples:
Switch off 454 and Solexa (but keep eventually keep Sanger
clipping): --noclipping=454,solexa
Switch off all: --noclipping
or --noclipping=all
Switches off loading TRACEINFO ancillary data in XML files for all technologies. Place it after [--fasta] and/or [--job=] quick switches.
Loads parameters from the filename given. Allows a maximum of 10 levels of recursion, i.e. a --params option appearing within a file that loads other parameter files (though I cannot think of useful applications with more than 3 levels).
When encountered during parameter parsing, MIRA will change the working directory immediately to the directory given and read and write files there.
Therefore, a call like mira -DI:cwd=/somedir
--params=myparameters.txt
will be enough to let MIRA
change to the directory /somedir
and then read
further parameters from a text file
myparamaters.txt
(which should be present there)
and at the same time have all the input and output of the
assembly occuring in firectory /somedir
.
Sets parameters suited for loading sequences from FASTA files. The version with =<filename> will also set the input file to the given filename.
Sets parameters suited for loading sequences from PHD files. The version with =<filename> will also set the input file to the given filename.
Sets parameters suited for loading sequences from CAF files. The version with =<filename> will also set the input file to the given filename.
The following switches can be placed anywhere on the command line without interfering with other switches:
Default is mira. Defines the project name for
this assembly. The project name automatically influences the
name of input and output files / directories. E.g. in the
default setting, the file names for the output of the assembly
in FASTA format would be mira_out.fasta
and
mira_out.fasta.qual
. Setting the project
name to "MyProject" would generate
MyProject_out.fasta
and
MyProject_out.fasta.qual
. See also
-FILENAME: and -DIRECTORY: for a list of names that are
influenced.
Default is mira. Works like [--project=<name>], but takes only effect on input files.
Default is mira. Works like [--project=<name>], but takes only effect on output files.
Note: A double dash (e.g. --params) may also be used instead of a single one in front of the quick switches.
Examples for using these switches can be found in the documentation files describing mira usage.
This is the main one-stop-switches for most assemblies. You need to make your choice mainly in four steps and in the end concatenate your choices to the [--job=] switch:
are you building an assembly from scratch (choose: denovo) or are you mapping reads to an existing backbone sequence (choose: mapping)? Pick one. Leaving this out automatically chooses denovo as default.
are the data you are assembling forming a larger contiguous sequence (choose: genome) or are you assembling small fragments like in EST or mRNA libraries (choose: est)? Pick one. Leaving this out automatically chooses genome as default.
do you want a quick and dirty assembly for first insights (choose: draft) or an assembly that should be able to tackle even most nasty cases (choose: accurate)? Pick one. Leaving this out automatically chooses accurate as default.
finally, which sequencing technologies have created your reads: sanger, 454, iontor, solexa or solid? You can pick multiple. Leaving this out automatically chooses only sanger as default.
Once you're done with your choices, concatenate everything with
commas and you're done. E.g.:
'--job=denovo,genome,draft,sanger,iontor
' will give
you a de-novo assembly of a genome in draft quality using a hybrid
assembly method with Sanger and Ion Torrent reads.
Extensive switches open up the full panoply of possibilities the MIRA assembler offers. This ranges from fine-tuning assemblies with the quick switches from above to setting parameters in a way so that mira is suited also for very special assembly cases.
Important note: As soon as you use a quick switch (especially --job), the 'default' settings given for extensive switches in the manual below probably do not apply anymore as the quick switch tweaks a lot of extensive switches internally.
With the introduction of new sequencing technologies, mira also had to be able to set values that allow technology specific behaviour of algorithms. One simple example for this could be the minimum length a read must have to be used in the assembly. For Sanger sequences, having this value to be 150 (meaning a read should have at least 150 unclipped bases) would be a very valid albeit conservative choice. For 454 reads and especially Solexa and ABI SOLiD reads however, this value would be ridiculously high.
To allow very fine grained behaviour, especially in hybrid assemblies, and to prevent the explosion of parameter names, mira uses technology mode switching in the parameter files or on the command line.
Example: assume the following basic command line
mira -fasta -job=denovo,genome,draft,454,solexa
Here is exemplary a part of the output of used parameters that mira will show:
... Assembly options (-AS): Number of passes (nop) : 1 Skim each pass (sep) : yes Maximum number of RMB break loops (rbl) : 1 Spoiler detection (sd) : no Last pass only (sdlpo) : yes Minimum read length (mrl) : [san] 80 [454] 40 [sxa] 20 Base default quality (bdq) : [san] 10 [454] 10 [sxa] 10 ...
You can see the two different kind of settings that mira uses: common settings (like [-AS:nop]) and technology dependent settings (like [-AS:mrl]), where for each sequencing technology used in the project, the setting can be different.
How would one set a minimum read length of 80 and a base default quality of 10 for 454 reads, but for Solexa reads a minimum read length of 30 with a base default quality of 15? The answer:
mira -job=denovo,genome,draft,454,solexa -fasta
454_SETTINGS -AS:mrl=80:bdq=10 SOLEXA_SETTINGS -AS:mrl=30:bdq=15
Notice the ..._SETTINGS section in the command line (or parameter file): these tell mira that all the following parameters until the advent of another switch are to be set specifically for the said technology.
Beside COMMON_SETTINGS there are currently 6 technology settings available:
SANGER_SETTINGS
454_SETTINGS
IONTOR_SETTINGS
PACBIO_SETTINGS
SOLEXA_SETTINGS
SOLID_SETTINGS
Some settings of mira are influencing global behaviour and are not related to a specific sequencing technology, these must be set in the COMMON_SETTINGS environment. For example, it would not make sense to try and set different number of assembly passes for each technology like in
mira -job=denovo,genome,draft,454,solexa -fasta
454_SETTINGS -AS:nop=4 SOLEXA_SETTINGS -AS:nop=3
mira will complain about cases like these. Simply set those common settings in an area prefixed with the COMMON_SETTINGS switch like in
mira -job=denovo,genome,draft,454,solexa -fasta
COMMON_SETTINGS -AS:nop=4 454_SETTINGS ... SOLEXA_SETTINGS ...
Since MIRA 3rc3, the parameter parser will help you by checking whether parameters are correctly defined as COMMON_SETTINGS or technology dependent setting.
General options control the type of assembly to be performed and other switches not belonging anywhere else.
string
]
Same as the quick switch [-project]. Defines the name of your project and influences the naming of your input and output files.
1 ≤ integer ≤ 256
]
Default is 2. Master switch to set the number of threads used in different parts of mira.
Note 1: currently only the SKIM algorithm uses multiple threads, other parts will follow.
Note 2: Although the main data structures are shared between the threads, there's some additional memory needed for each thread.
Note 3: when running the SKIM in parallel threads, MIRA can give different results when started with the same data and same arguments. While the effect could be averted for SKIM, the memory cost for doing so would be an additional 50% for one of the large tables, so this has not been implemented at the moment. Besides, at the latest when the Smith-Watermans run in parallel, this could not be easily avoided at all.
on|yes|1, off|no|0
]
Default is Yes. Whether mira tries to optimise run time of certain algorithms in a space/time trade-off memory usage, increasing or reducing some internal tables as memory permits.
Note 1: This functionality currently relies on the
/proc
file system giving information on
the system memory ("MemTotal" in /proc/meminfo) and the memory
usage of the current process ("VmSize" in
/proc/self/status
). If this is not
available, the functionality is switched off.
Note 2: The automatic memory management can only work if there actually is unused system memory. It's not a wonder switch which reduces memory consumption. In tight memory situations, memory management has no effect and the algorithms fall back to minimum table sizes. This means that the effective size in memory can grow larger than given in the memory management parameters, but then MIRA will try to keep the additional memory requirements to a minimum.
0 ≤ integer
]
Default is 0. If automatic memory management is used (see above), this number is the size in gigabytes that the MIRA process will use as maximum target size when looking for space/time trade-offs. A value of 0 means that MIRA does not try keep a fixed upper limit.
Note: when in competition to [-GE:kpmf] (see below), the smaller of both sizes is taken as target. Example: if your machine has 64 GiB but you limit the use to 32 GiB, then the MIRA process will try to stay within these 32 GiB.
0 ≤ integer
]
Default is 10. If automatic memory management is used (see above), this number works a bit like [-GE:mps] but the other way round: it tries to keep x percent of the memory free.
Note: when in competition to [-GE:mps] (see above), the argument leaving the most memory free is taken as target. Example: if your machine has 64 GiB and you limit the use to 42 GiB via [-GE:mps] but have a [-GE:kpmf] of 50, then the MIRA process will try to stay within 64-(64*50%)=32 GiB.
1 ≤ integer ≤ 4
]
Default is 1. Controls the starting step of the SNP search in EST pipeline and is therefore only useful in miraSearchESTSNPs.
EST assembly is a three step process, each with different settings to the assembly engine, with the result of each step being saved to disk. If results of previous steps are present in a directory, one can easily "play around" with different setting for subsequent steps by reusing the results of the previous steps and directly starting with step two or three.
on|yes|1, off|no|0
]
Default is Yes. Two reads sequenced from the same clone template form a read pair with a known minimum and maximum distance. This feature will definitively help for contigs containing lots of repeats. Set this to 'yes' if your data contains information on insert sizes (e.g. in paired-end sequencing).
Information on insert sizes can be given via the SI tag in EXP files (for each read pair individually), via insert_size and insert_stdev elements of NCBI TRACEINFO XML files or for the whole project using [-GE:tismin] and [-GE:tismax] (see below).
Additional information to set the orientation of the read-pairs can be given via [-GE:tpbd].
integer
]
Default is -1. The default value for the minimum template size for reads that have no template size in ancillary data. If -1 is used as value, then no default value is given and reads without ancillary data giving this number will behave as if they had no template.
integer
]
Default is -1. The default value for the maximum template size for reads that have no template size in ancillary data. If -1 is used as value, then no default value is given and reads without ancillary data giving this number will behave as if they had no template.
-1 or 1
]
Default is -1 for all sequencing technologies.
This value tells MIRA how read-pairs of a template must be oriented in a contig to be valid. A value of "-1" means the orientation must be 5'-3' to 3'-5', a value of "1" means 5'-3' to 5'-3'.
Set this to "1" if you assemble paired-end 454 data downloaded from the Short Read Archives (SRAs, at the NCBI and EMBL). Set this also to "1" for Solexa data where the paired-end sequencing protocol used creates 5'-3' to 5'-3' pairs.
![]() | Note |
---|---|
Although with Solexa it is possible to build libraries in both directions, with MIRA it is currently not possible to mix within the same sequencing technology paired-end reads which need "-1" as direction with mate-pair reads which have "1" as direction. This will be worked on if the need arises. |
on|yes|1, off|no|0
]
Default is yes. Controls whether date and time are printed out during the assembly. Suppressing it is not useful in normal operation, only when debugging or benchmarking.
Here one defines what type of reads to load.
on|yes|1, off|no|0
]
Default is No. Defines whether to load data generated by a given technology.
fofnexp, fasta, fastq, caf, phd,
fofnphd
]
Default is fasta. Takes effect only when [-LR:lsd]) is 'yes'.
Defines whether to load for assembly from FASTA sequences
(<projectname>_in.fasta
) and their
qualities
(<projectname>_in.fasta.qual
), from
a FASTQ file
(<projectname>_in.fastq
), from EXP
files from a file of filenames
(<projectname>_in.fofn
), from a phd
file (<projectname>_in.phd
) or from
a CAF file (<projectname>_in.caf
)
and assemble or eventually reassemble it.
Note 1: Only Sanger supports all file types. 454, Ion Torrent and Solexa support only FASTA and FASTQ.
Note 2: fofnphd currently not available.
none, SCF
]
Default is SCF. Takes effect only when [-LR:lsd]) is 'yes' and for Sanger reads.
Defines the source format for reading qualities from external sources. Normally takes effect only when these are not present in the format of the load_job project (EXP and FASTA can have them, CAF and PHD must have them).
on|yes|1, off|no|0
]
Takes effect only when [-LR:lsd]) is 'yes' and for Sanger reads.
Default is no, only takes effect when load_job is fofnexp. Defines whether or not the qualities from the external source override the possibly loaded qualities from the load_job project. This might be of use in case some post-processing software fiddles around with the quality values of the input file but one wants to have the original ones.
on|yes|1, off|no|0
]
Default is yes. Takes effect only when [-LR:lsd]) is 'yes' and for Sanger reads.
Should there be a major mismatch between the external quality source and the sequence (e.g.: the base sequence read from a SCF file does not match the originally read base sequence), should the read be excluded from assembly or not. If not, it will use the qualities it had before trying to load the external qualities (either default qualities or the ones loaded from the original source).
on|yes|1, off|no|0
]
Default is yes. When set to yes, MIRA will stop the assembly if there is no quality file for a given sequence file. E.g., if the FASTA quality file is missing when loading from FASTA.
sanger, tigr, fr, stlouis, solexa
]
Default is sanger for Sanger sequencing data, fr for 454 and Ion Torrent while solexa for Solexa. Defines the read naming scheme for read suffixes. These suffixes can be used by mira to deduce a template name if none is given in ancillary data.
Currently supported: Sanger centre, TIGR, simple forward / reverse naming, St. Louis schemes and Solexa/Illumina schemes are supported out of the box.
How to choose: please read the documentation available at the different centres or ask your sequence provider. In a nutshell (and probably over-simplified):
"somename.[pqsfrw][12][bckdeflmnpt][a|b|c|..." (e.g. U13a08f10.p1ca), but the length of the postfix must be at least 4 characters, i.e., ".p" alone will not be recognised.
Usually, ".p" + 3 characters or "f" + 3 characters are used for forwards reads, while reverse complement reads take either ".q" or ".r" (+ 3 characters in both cases).
"somenameTF*|TR*|TA*" (e.g. GCPBN02TF or GCPDL68TABRPT103A58B),
Forward reads take "TF*", reverse reads "TR*".
"somename.[fr]*" (e.g. E0K6C4E01DIGEW.f or E0K6C4E01BNDXN.r2nd),
".f*" for forward, ".r*" for reverse.
"somename.[sfrxzyingtpedca]*"
Even simpler than the forward/reverse scheme, it allows only for one two reads per template: "somename/[12]"
on|yes|1, off|no|0
]
Default is no. This switch applies only for sequences from older Illumina / Solexa sequencing technology when loading from FASTA! Defines whether the FASTA quality file contains Solexa scores (which also have negative values) instead of quality values. Solexa scores also have negative values. If set to yes, mira will automatically convert the Solexa scores to phred style quality values.
integer
]
Default is 0. This switch applies only for sequences loaded from FASTQ format!
Defines the quality offset used to convert characters into quality values. Usually, 33 is used for FASTQ in Sanger style, Solexa 1.0 format uses 59 (I think) and newer Solexa 1.3 format uses 64.
The default value of 0 switches on routines that try to guess the correct value from the data present in the FASTQ (which they do when the data contains at least one read which at least one base with quality between 0 and 4).
on|yes|1, off|no|0
]
Default is no. Some file formats above (FASTA, PHD or even CAF and EXP) possibly do not contain all the info necessary or useful for each read of an assembly. Should additional information -- like clipping positions etc. -- be available in a XML trace info file in NCBI format (see File formats), then set this option to yes and it will be merged to all the data loaded, be it for Sanger, 454, Ion Torrent, Solexa or SOLiD technology. See also -FILENAME: for the name of the XML file to load.
Please note: quality clippings given here will override quality clippings loaded earlier (e.g. in EXP files) or performed by mira. Minimum clippings will still be made by the program, though.
on|yes|1, off|no|0
]
Default is no. If set to yes, the project will not be assembled and no assembly output files will be produced. Instead, the project files will only be loaded. This switch is useful for checking consistency of input files.
General options for controlling the assembly.
integer
]
Default is dependent of the sequencing technology and assembly quality level. Defines how many iterations of the whole assembly process are done.
As a special use case, a value of 0 will let MIRA just run the following tasks: loading and clipping of reads as well as calculating hash frequencies and read repeat information. The resulting reads can then be found as MAF file in the checkpoint directory; the read repeat information in the info directory.
Early termination: if the number of passes was chosen too
high, one can simply create a file
. At
the beginning of a new pass, MIRA checks for the existence of
that file and, if it finds it, acknowledges by renaming it to
projectname
_assembly/projectname
_d_chkpt/terminateterminate_acknowledged
and then run 2
more passes (with special "last pass routines") before
finishing the assembly.
on|yes|1, off|no|0
]
Default is dependent of the sequencing technology and assembly quality level. Defines whether the skim algorithm (and with it also the recalculation of Smith-Waterman alignments) is called in-between each main pass. If set to no, skimming is done only when needed by the workflow: either when read extensions are searched for ( [-DP:ure]) or when possible vector leftovers are to be clipped ( [-CL:pvc]).
Setting this option to yes is highly recommended, setting it to no only for quick and dirty assemblies.
integer > 0
]
Default is dependent of the sequencing technology and assembly quality level. Defines the maximum number of times a contig can be rebuilt during a main assembly passes ([-AS:nop]) if misassemblies due to possible repeats are found.
integer
]
Default is 0. Defines how many contigs are maximally built in each pass. A value of 0 stands for 'unlimited'. Values >0 can be used for special use cases like test assemblies etc.
If in doubt, do not touch this parameter.
on|yes|1, off|no|0
]
Default is is currently yes. Tells mira to use coverage information accumulated over time to more accurately pinpoint reads that are in repetitive regions.
float > 1.0
]
Default is 2.0 for all sequencing technologies in most assembly cases. This option says this: if mira a read has ever been aligned at positions where the total coverage of all reads of the same sequencing technology attained the average coverage times [-AS:ardct] (over a length of [-AS:ardml], see below), then this read is considered to be repetitive.
integer > 1
]
Default is dependent of the sequencing technology, currently 400 for Sanger and 200 for 454 and Ion Torrent.
A coverage must be at least this number of bases higher than [-AS:ardct] before being really treated as repeat.
integer > 1
]
Default is dependent of the sequencing technology.
on|yes|1,
off|no|0
]
Default is currently yes for genome assemblies and no for EST assemblies or assemblies with Solexa data.
Takes effect only if uniform read distribution ([-AS:urd]) is on.
When set to yes, mira will analyse coverage of contigs built at a certain stage of the assembly and estimate an average expected coverage of reads for contigs. This value will be used in subsequent passes of the assembly to ensure that no part of the contig gets significantly more read coverage of reads that were previously identified as repetitive than the estimated average coverage allows for.
This switch is useful to disentangle repeats that are otherwise 100% identical and generally allows to build larger contigs. It is expected to be useful for Sanger and 454 sequences. Usage of this switch with Solexa and Ion Torrent data is currently not recommended.
It is a real improvement to disentangle repeats, but has the side-effect of creating some "contig debris" (small and low coverage contigs, things you normally can safely throw away as they are representing sequence that already has enough coverage).
This switch must be set to no for EST assembly, assembly of transcripts etc. It is recommended to also switch this off for mapping assemblies.
integer > 0
]
Default is dependent of the sequencing technology and assembly quality level. Recommended values are: 3 for an assembly with 3 to 4 passes ([-AS:nop]). Assemblies with 5 passes or more should set the value to the number of passes minus 2.
Takes effect only if uniform read distribution ([-AS:urd]) is on.
float > 1.0
]
Default is 1.5 for all sequencing technologies in most assembly cases. The [--highlyrepetitive] quick-switch sets this to 1.2.
This option says this: if mira determined that the average coverage is $x$, then in subsequent passes it will allow coverage for reads determined to be repetitive to be built into the contig only up to a total coverage of $x*urdcm$. Reads that bring the coverage above the threshold will be rejected from that specific place in the contig (and either be built into another copy of the repeat somewhere else or end up as contig debris).
Please note that the lower [-AS:urdcm] is, the more contig debris you will end up with (contigs with an average coverage less than half of the expected coverage, mostly short contigs with just a couple of reads).
Takes effect only if uniform read distribution ([-AS:urd]) is on.
on|yes|1, off|no|0
]
Default is is dependent on --job quality: currently no for draft and yes for accurate. Switched of for EST assembly.
Tells mira to use keep repeats longer that the length of reads in separate contigs.
on|yes|1, off|no|0
]
Default is dependent of the sequencing technology and assembly quality level. A spoiler can be either a chimeric read or it is a read with long parts of unclipped vector sequence still included (that was too long for the [-CL:pvc] vector leftover clipping routines). A spoiler typically prevents contigs to be joined, MIRA will cut them back so that they represent no more harm to the assembly.
Recommended for assemblies of mid- to high-coverage genomic assemblies, not recommended for assemblies of ESTs as one might loose splice variants with that.
A minimum number of two assembly passes ([-AS:nop]) must be run for this option to take effect.
on|yes|1, off|no|0
]
Default is yes. Defines whether the spoiler detection algorithms are run only for the last pass or for all passes ( [-AS:nop]).
Takes effect only if spoiler detection ([-AS:sd]) is on. If in doubt, leave it to 'yes'.
integer ≥ 20
]
Default is dependent of the sequencing technology. Defines the minimum length that reads must have to be considered for the assembly. Shorter sequences will be filtered out at the beginning of the process and won't be present in the final project.
integer ≥ 1
]
Default is dependent of the sequencing technology and the [--job] parameter. For genome assemblies it's usually around 2 for Sanger, 5 for 454, 5 for Ion Torrent, 5 for PacBio and 10 for Solexa. In EST assemblies, it's currently 2 for all sequencing technologies.
Defines the minimum number of reads a contig must have before it is built or saved by MIRA. Overlap clusters with less reads than defined will not be assembled into contigs but reads in these clusters will be immediately transferred to debris.
This parameter is useful to considerably reduce assembly time in large projects with millions of reads (like in Solexa projects) where a lot of small "junk" contigs with contamination sequence or otherwise uninteresting data may be created otherwise.
![]() | Note |
---|---|
Important: a value larger 1 of this parameter interferes with the functioning of [-OUT:sssip] and [-OUT:stsip]. |
integer ≥ 0
]
Default is currently 10 for all sequencing technologies. Defines the default base quality of reads that have no quality read from file.
on|yes|1, off|no|0
]
Default is yes. When set to yes, MIRA will stop the assembly if any read has no quality values loaded.
on|yes|1, off|no|0
]
Default is yes. MIRA has two different pathfinder algorithms it chooses from to find its way through the (more or less) complete set of possible sequence overlaps: a genomic and an EST pathfinder. The genomic looks a bit into the future of the assembly and tries to stay on safe grounds using a maximum of information already present in the contig that is being built. The EST version on the contrary will directly jump at the complex cases posed by very similar repetitive sequences and try to solve those first and is willing to fall back to first-come-first-served when really bad cases (like, e.g., coverage with thousands of sequences) are encountered.
Generally, the genomic pathfinder will also work quite well with EST sequences (but might get slowed down a lot in pathological cases), while the EST algorithm does not work so well on genomes. If in doubt, leave on yes for genome projects and set to no for EST projects.
on|yes|1, off|no|0
]
Default is yes. Another important switch if you plan to assemble non-normalised EST libraries, where some ESTs may reach coverages of several hundreds or thousands of reads. This switch lets MIRA save a lot of computational time when aligning those extremely high coverage areas (but only there), at the expense of some accuracy.
integer > 0
]
Default is 500. Defines the number of potential partners a read must have for MIRA switching into emergency search stop mode for that read.
on|yes|1,off|no|0
]
Default is no. Defines whether there is an upper limit of time to be used to build one contig. Set this to yes in EST assemblies where you think that extremely high coverages occur. Less useful for assembly of genomic sequences.
integer > 0
]
Default is 10000. Depending on [-AS:umcbt] above, this number defines the time in seconds allocated to building one contig.
General options for controlling backbone options for mapping assemblies as well as general strain information.
on|yes|1, off|no|0
]
Default is no. Straindata is a key value file, one read per line. First the name of the read, then the strain name of the organism the read comes from. It is used by the program to differentiate different types of SNPs appearing in organisms and classifying them.
on|yes|1, off|no|0
]
Default is no for de-novo assemblies and yes for mapping.
Defines whether, after having loaded all data from all possible source, MIRA will assign a strain name to reads which didn't get strain information via said data files (either NCBI TRACEINFO XML files or the simple MIRA straindata files). The strain name to assign the is determined via [-SB:dsn] (see below).
string
]
Default is StrainX. Defines the strain name to assign to reads which don't have a strain name after loading, works only if [-SB:ads=yes] (see above).
on|yes|1, off|no|0
]
Default is no. A backbone is a sequence (or a previous assembly) that is used as template for a mapping assembly. The current assembly process will assemble reads first to those loaded backbone contigs before creating new contigs (if any).
This feature is helpful for assembling against previous (and already possibly edited) assembly iterations, or to make a comparative assembly of two very closely related organisms. Please read "very closely related" as in: only SNP mutations or short indels present.
0 < integer
]
Default is dependent on assembly quality level chosen: 0 for 'draft' and [-AS:nop] divided by 2 for 'accurate'.
When assembling against backbones, this parameter defines the pass iteration (see [-AS:nop]) from which on the backbones will be really used. In the passes preceding this number, the non-backbone reads will be assembled together as if no backbones existed. This allows mira to correctly spot repetitive stretches that differ by single bases and tag them accordingly. Note that full assemblies are considerably slower than mapping assemblies, so be careful with this when assembling millions of reads.
Rule of thumb: if backbones belong to same strain as reads to assemble, set to 1. If backbones are a different strain, then set [-SB:sbuib] to 1 lower than [-AS:nop] (example: nop=4 and sbuip=3).
string
]
Default isReferenceStrain. Defines the name of the strain that the backbone sequences have.
on|yes|1, off|no|0
]
Default is no. Useful when using CAF as input for backbone: forces all reads of the backbone contigs to get assigned the new backbone strain, even if they previously had other strains assigned.
Main usage is in multi-step hybrid assemblies.
string
]
Default is Default is an empty string. Useful when using CAF as input for backbone: when set to a given strain name, mira will internally use only reads from the given strain to build the rails it will use to align reads.
Main usage is in multi-step hybrid assemblies.
fasta, caf, gbf
]
Default is fasta. Defines the filetype of the backbone file given. Currently only FASTA, CAF and GBF files are supported.
When GBF (GenBank files, more commonly named '.gbk') files are loaded, the features within theses files are automatically transformed into Staden compatible tags and get passed through the assembly.
0 ≤ integer ≤ 10000
]
Default is 0. Parameter for the internal sectioning size of the backbone to compute optimal alignments. Should be set to two times length of longest read in input data + 15%. When set to 0, MIRA will compute optimal values from the data loaded.
0 ≤ integer ≤ 2000
]
Default is 0. Parameter for the internal sectioning size of the backbone to compute optimal alignments. Should be set to length of the longest read. When set to 0, MIRA will compute optimal values from the data loaded.
-1 ≤ integer ≤ 100
]
Default is -1. Defines the default quality that the backbone sequences have if they came without quality values in their files (like in GBF format or when FASTA is used without .qual files). A value of -1 mira to use the same default quality for backbones as for reads.
on|yes|1, off|no|0
]
Default is no. Standard mapping assembly mode of the assembler is to map available reads to a backbone and discard reads that do not fit. If set to 'yes', mira will use reads that did not map to the backbone(s) to make new contigs (if possible). Please note: while a simple mapping assembly is comparatively cheap in terms of memory and time consumed, setting this option to 'yes' means that behind the scenes data for a full blown de-novo assembly is generated in addition to the data needed for a mapping assembly, which makes it a bit more costly that a de-novo assembly per se.
Options for controlling some data processing during the assembly.
on|yes|1, off|no|0
]
Default is dependent of the sequencing technology used: yes for Sanger, no for all others. mira expects the sequences it is given to be quality clipped. During the assembly though, it will try to extend reads into the clipped region and gain additional coverage by analysing Smith-Waterman alignments between reads that were found to be valid. Only the right clip is extended though, the left clip (most of the time containing sequencing vector) is never touched.
integer > 0
]
Default is dependent of the sequencing technology used. Only takes effect when [-DP:ure] (see above) is set to yes. The read extension routines use a sliding window approach on Smith-Waterman alignments. This parameter defines the window length.
integer > 0
]
Default is dependent of the sequencing technology used. Only takes effect when [-DP:ure] (see above) is set to yes. The read extension routines use a sliding window approach on Smith-Waterman alignments. This parameter defines the number maximum number of errors (=disagreements) between two alignment in the given window.
integer ≥ 0
]
Default is dependent of the sequencing technology used. Only takes effect when [-DP:ure] (see above) is set to yes. The read extension routines can be called before assembly and/or after each assembly pass (see [-AS:nop]). This parameter defines the first pass in which the read extension routines are called. The default of 0 tells mira to extend the reads the first time before the first assembly pass.
integer ≥ 0
]
Default is dependent of the sequencing technology used. Only takes effect when [-DP:ure] (see above) is set to yes. The read extension routines can be called before assembly and/or after each assembly pass (see [-AS:nop]). This parameter defines the last pass in which the read extension routines are called. The default of 0 tells mira to extend the reads the last time before the first assembly pass.
Controls for clipping options: when and how sequences should be clipped.
Every option in this section can be set individually for every sequencing technology, giving a very fine grained control on how reads are clipped for each technology.
on|yes|1, off|no|0
]
Default is no. Uses the parameters [-CL:msvsgs:msvsmfg:msvsmeg] (see below).
Before running mira, the ssaha2 or smalt programs from the Sanger centre can be used to detect possible vector sequence stretches in the input data for the assembly. This parameter - if set to yes - will let mira load the result file of a ssaha2 or smalt run and tag the possible vector sequences at the ends of reads.
ssaha2 must be called like this "ssaha2
<ssaha2options> vector.fasta sequences.fasta
"
to generate an output that can be parsed by mira. In the above
example, replace vector.fasta
by the name
of the file with your vector sequences and
sequences.fasta
by the name of the file
containing your sequencing data.
smalt must be called like this: "smalt map -f ssaha
<ssaha2options> hash_index sequences.fasta
"
This makes you basically independent from any other commercial or license-requiring vector screening software. For Sanger reads, a combination of lucy and ssaha2 or smalt together with this parameter should do the trick. For reads coming from 454 pyro-sequencing, ssaha2 or smalt and this parameter will also work very well. See the usage manual for a walkthrough example on how to use SSAHA2 / SMALT screening data.
Note 1: the output format of SSAHA2 must the native output
format (-output ssaha2
). For SMALT, the
output option -f ssaha
must be used. Other
formats cannot be parsed by MIRA.
Note 2: when using SSAHA2 results, the input file must be
named
<projectname>_ssaha2vectorscreen_in.txt
. When
using SMALT results, the input file must be named
<projectname>_smaltvectorscreen_in.txt
.
Note 3: if both a ssah2 and smalt result file are present, both will be read.
Note 4: I currently use the following SSAHA2 options:
-kmer 8 -skip 1 -seeds 1 -score 12 -cmatch 9 -ckmer
6
Note 5: Anyone contributing SMALT parameters?
Note 6: the sequence vector clippings generated from SSAHA2 / SMALT data do not replace sequence vector clippings loaded via the EXP, CAF or XML files, they rather extend them.
integer ≥ 0
]
Default is dependent of the sequencing technology used. Takes effect only if [-CL:msvs] is yes. While performing the clip of screened vector sequences, mira will look if it can merge larger chunks of sequencing vector bases that are a maximum of [-CL:msvgsgs] apart.
integer ≥ 0
]
Default is dependent of the sequencing technology used. Takes effect only if [-CL:msvs] is yes. While performing the clip of screened vector sequences at the start of a sequence, mira will allow up to this number of non-vector bases in front of a vector stretch.
integer ≥ 0
]
Default is dependent of the sequencing technology used. Takes effect only if [-CL:msvs] is yes. While performing the clip of screened vector sequences at the end of a sequence, mira will allow up to this number of non-vector bases behind a vector stretch.
on|yes|1, off|no|0
]
Default is dependent of the sequencing technology used: yes for Sanger, no for any other. mira will try to identify possible sequencing vector relics present at the start of a sequence and clip them away. These relics are usually a few bases long and were not correctly removed from the sequence in data preprocessing steps of external programs.
You might want to turn off this option if you know (or think) that your data contains a lot of repeats and the option below to fine tune the clipping behaviour does not give the expected results.
You certainly want to turn off this option in EST assemblies as this will quite certainly cut back (and thus hide) different splice variants. But then make certain that you pre-processing of Sanger data (sequencing vector removal) is good, other sequencing technologies are not affected then.
integer ≥ 0
]
Default is dependent of the sequencing technology used. The clipping of possible vector relics option works quite well. Unfortunately, especially the bounds of repeats or differences in EST splice variants sometimes show the same alignment behaviour than possible sequencing vector relics and could therefore also be clipped.
To refrain the vector clipping from mistakenly clip repetitive regions or EST splice variants, this option puts an upper bound to the number of bases a potential clip is allowed to have. If the number of bases is below or equal to this threshold, the bases are clipped. If the number of bases exceeds the threshold, the clip is NOT performed.
Setting the value to 0 turns off the threshold, i.e., clips are then always performed if a potential vector was found.
on|yes|1, off|no|0
]
Default is no. This will let mira perform its own quality clipping before sequences are entered into the assembly. The clip function performed is a sequence end window quality clip with back iteration to get a maximum number of bases as useful sequence. Note that the bases clipped away here can still be used afterwards if there is enough evidence supporting their correctness when the option [-DP:ure] is turned on.
Warning: The windowing algorithm works pretty well for Sanger, but apparently does not like 454 type data. It's advisable to not switch it on for 454. Beside, the 454 quality clipping algorithm performs a pretty decent albeit not perfect job, so for genomic 454 data (not! ESTs), it is currently recommended to use a combination of [-CL:emrc] and [-DP:ure].
integer ≥ 15 and ≤ 35
]
Default is dependent of the sequencing technology used. This is the minimum quality bases in a window require to be accepted. Please be cautious not to take too extreme values here, because then the clipping will be too lax or too harsh. Values below 15 and higher than 30-35 are not recommended.
integer ≥ 10
]
Default is dependent of the sequencing technology used. This is the length of a window in bases for the quality clip.
on|yes|1, off|no|0
]
Default is no. This option allows to clip reads that were not correctly preprocess and have unclipped bad quality stretches that might prevent a good assembly.
mira will search the sequence in forward direction for a stretch of bases that have in average a quality less than a defined threshold and then set the right quality clip of this sequence to cover the given stretch.
integer ≥ 0
]
Default is dependent of the sequencing technology used. Defines the minimum average quality a given window of bases must have. If this quality is not reached, the sequence will be clipped at this position.
integer ≥ 0
]
Default is dependent of the sequencing technology used. Defines the length of the window within which the average quality of the bases are computed.
on|yes|1, off|no|0
]
Default is dependent of the sequencing technology used. This will let mira perform a 'clipping' of bases that were masked out (replaced with the character X). It is generally not a good idea to use mask bases to remove unwanted portions of a sequence, the EXP file format and the NCBI traceinfo format have excellent possibilities to circumvent this. But because a lot of preprocessing software are built around cross_match, scylla- and phrap-style of base masking, the need arose for mira to be able to handle this, too. mira will look at the start and end of each sequence to see whether there are masked bases that should be 'clipped'.
integer ≥ 0
]
Default is dependent of the sequencing technology used. While performing the clip of masked bases, mira will look if it can merge larger chunks of masked bases that are a maximum of [-CL:mbcgs] apart.
integer ≥ 0
]
Default is dependent of the sequencing technology used. While performing the clip of masked bases at the start of a sequence, mira will allow up to this number of unmasked bases in front of a masked stretch.
integer ≥ 0
]
Default is dependent of the sequencing technology used. While performing the clip of masked bases at the end of a sequence, mira will allow up to this number of unmasked bases behind a masked stretch.
on|yes|1, off|no|0
]
Default is dependent of the sequencing technology used: on for 454 data, off for all others. This will let mira perform a 'clipping' of bases that are in lowercase at both ends of a sequence, leaving only the uppercase sequence. Useful when handling 454 data that does not have ancillary data in XML format.
on|yes|1, off|no|0
]
Default is no. This option is useful in EST assembly. Poly-A stretches in forward reads and poly-T stretches in reverse reads that were not correctly masked or clipped in preprocessing steps from external programs get clipped or tagged here. The assembler will not use these stretches for critical operations.
on|yes|1, off|no|0
]
Default is no. This option is currently not active (as of version 2.9.22).
In future, this will allow to keep the poly-A signal in the reads and tag them. The tags provide a good visual anchor when looking at the assembly with different programs.
integer > 0
]
Default is 10. Only takes effect when [-CP:cpat] (see above) is set to yes. Defines the number of ``A'' (in forward direction) or ``T'' (in reverse direction'' must be present to be considered a poly-A signal stretch.
integer > 0
]
Default is 1. Only takes effect when [-CL:cpat] (see above) is set to yes. Defines the maximum number of errors allowed in the potential poly-A signal stretch. The distribution of these errors is not important.
integer > 0
]
Default is 9. Only takes effect when [-CL:cpat] (see above) is set to yes.Defines the number of bases from the end of a sequence (if masked: from the end of the masked area) within which a poly-A signal stretch is looked for.
on|yes|1, off|no|0
]
Default is dependent of the sequencing technology used. If on, ensures a minimum left clip on each read according to the parameters in [-CL:mlcr:smlc]
integer ≥ 0
]
Default is dependent of the sequencing technology used. If [-CL:emlc] is on, checks whether there is a left clip which length is at least the one specified here.
integer ≥ 0
]
Default is dependent of the sequencing technology used. If [-CL:emlc] is on and actual left clip is < [-CL:mlcr], set left clip of read to the value given here.
on|yes|1, off|no|0
]
Default is dependent of the sequencing technology used. If on, ensures a minimum right clip on each read according to the parameters in [-CL:mrcr:smrc]
integer ≥ 0
]
Default is dependent of the sequencing technology used. If [-CL:emrc] is on, checks whether there is a right clip which length is at least the one specified here.
integer ≥ 0
]
Default is dependent of the sequencing technology used. If [-CL:emrc] is on and actual right clip is < [-CL:mrcr], set the length of the right clip of read to the value given here.
on|yes|1, off|no|0
]
Default is yes for [--job=genome] assemblies and no for [--job=est] assemblies.
The SKIM routines of MIRA can be also used without much time overhead to find chimeric reads. When this parameter is set, MIRA will use that info to cut back chimeras to their longest non-chimeric length.
![]() | Warning |
---|---|
When working on low coverage data (e.g. < 5 to 6x Sanger and < 10x 454 or 10x Ion Torrent, you may want to switch off this option if you try to go for the longest contigs. Reason: single reads joining otherwise disjunct contigs will probably be categorised as chimeras. |
on|yes|1, off|no|0
]
Default is currently no.
The SKIM routines of MIRA can be also used without much time overhead to find junk sequence at end of reads. When this parameter is set, MIRA will use that info to cut back junk in reads.
It is currently suggested to leave this parameter switched off as the routines seem to be a bit too "trigger happy" and also cut back perfectly valid sequences.
on|yes|1, off|no|0
]
Default is is dependent on --job quality: currently yes for all genome assemblies. Switched off for EST assemblies (but one wmight want to switch it on sometimes).
This implements a pretty powerful strategy to ensure a good "high confidence region" (HCR) in reads, basically eliminating 99.9% of all junk at the 5' and 3' ends of reads. Note that one still must ensure that sequencing vectors (Sanger) or adaptor sequences (454, Solexa ion Torrent) are "more or less" clipped prior to assembly.
![]() | Warning |
---|---|
Extremely effective, but should NOT be used for very low coverage genomic data, for EST projects or if one wants to retain rare transcripts. |
on|yes|1,
off|no|0
]
Default is is dependent yes.
Solexa data has a pretty awful problem with in some reads when
a GGCxG
motif occurs (read more about it in
the chapter on Solexa data). In short: the sequencing errors
produced by this problem lead to many false positive SNP
discoveries in mapping assemblies or problems in contig
building in de-novo assembly.
MIRA knows about this problem and can look for it in Solexa reads during the proposed end clipping and further clip back the reads, greatly minimising the impact of this problem.
integer ≥ 10
]
Default is is dependent on technology and quality in the --job switch: usually between 17 and 21 for Sanger, higher for 454 (up to 27) and highest for Solexa (31). Ion Toorent has at the moment 17, but this may change in the future to somewhat higher values.
This parameter defines the minimum number of bases at each end of a read that should be free of any sequencing errors. Note that the algorithm is based on SKIM hashing (see below), and compares hashes of all reads with each other. Therefore, using values less than 12 will lead to false negative hits.
Options that control the behaviour of the initial fast all-against-all read comparison algorithm. Matches found here will be confirmed later in the alignment phase. The new SKIM3 algorithm that is in place since version 2.7.4 uses a hash based algorithm that works similarly to SSAHA (see Ning Z, Cox AJ, Mullikin JC; "SSAHA: a fast search method for large DNA databases."; Genome Res. 2001;11;1725-9).
The major differences of SKIM3 and SSAHA are:
the word length n of a hash can be up to 31 bases (in 64 bit versions of MIRA)
SKIM3 uses a maximum fixed amount of RAM that is independent of the word size. E.g., SSAHA would need 4 exabyte to work with word length of 30 bases ... SKIM3 just takes a couple of hundred MB.
The parameters for SKIM3:
integer ≥ 1
]
Number of threads used in SKIM, default is 2. A few parts of SKIM are non-threaded, so the speedup is not exactly linear, but it should be very close. E.g., with 2 processors I get a speedup of 180-195%, with 4 between 350 and 395%.
Although the main data structures are shared between the threads, there's some additional memory needed for each thread.
on|yes|1, off|no|0
]
Default is on. Defines whether SKIM searches for matches only in forward/forward direction or whether it also looks for forward/reverse direction.
You usually will not want to touch the default, except for very special application cases where you do not want MIRA to use reverse complement sequences at all.
10 < integer ≤ 32
]
Controls the number of consecutive bases $n$ which are used as a word hash. The higher the value, the faster the search. The lower the value, the more weak matches are found. Values below 10 are not recommended. Defaults are dependend on "--job" switch.
integer ≥ 1
]
Default is 1. This is a parameter controlling the stepping increment $s$ with which hashes are generated. This allows for more or less fine grained search as matches are found with at least $n+s$ (see [-SK:bph]) equal bases. The higher the value, the faster the search. The lower the value, the more weak matches are found.
integer ≥ 1
]
Default is dependent of the sequencing technology used and assembly quality wished. Controls the relative percentage of exact word matches in an approximate overlap that has to be reached to accept this overlap as possible match. Increasing this number will decrease the number of possible alignments that have to be checked by Smith-Waterman later on in the assembly, but it also might lead to the rejection of weaker overlaps (i.e. overlaps that contain a higher number of mismatches).
Note: most of the time it makes sense to keep this parameter in sync with [-AL:mrs].
integer ≥ 1
]
Default is 2000. Controls the maximum number of possible hits one read can maximally transport to the graph edge reduction phase. If more potential hits are found, only the best ones are taken.
In the pre-2.9.x series, this was an important option for tackling projects which contain extreme assembly conditions. It still is if you run out of memory in the graph edge reduction phase. Try then to lower it to 1000, 500 or even 100.
As the assembly increases in passes ([-AS:nop]), different combinations of possible hits will be checked, always the probably best ones first. So the accuracy of the assembly should only suffer when lowering this number too much.
on|yes|1, off|no|0
]
Default is currently (3.4.0) yes for accurate mapping jobs. Takes effect only in mapping assemblies. Defines whether SKIM hits against a backbone (reference) sequence with less than 100% identity are double checked with Smith-Waterman to improve mapping accuracy.
You will want to set this option to yes whenever your reference sequence contains more complex or numerous repeats and your data has SNPs in those areas.
float < 0
]
During SKIM analysism, MIRA will estimate how repetitive parts of reads are. Parts which are occuring less than [-SK:fenn] times the average occurence will be tagged with a HAF2 (less than average) tag.
float < 0
]
During SKIM analysism, MIRA will estimate how repetitive parts of reads are. Parts which are occuring more than [-SK:fenn] but less than [-SK:fexn] times the average occurence will be tagged with a HAF3 (normal) tag.
float < 0
]
During SKIM analysism, MIRA will estimate how repetitive parts of reads are. Parts which are occuring more than [-SK:fexn] but less than [-SK:fer] times the average occurence will be tagged with a HAF4 (above average) tag.
float < 0
]
During SKIM analysism, MIRA will estimate how repetitive parts of reads are. Parts which are occuring more than [-SK:fer] but less than [-SK:fehr] times the average occurence will be tagged with a HAF5 (repeat) tag.
float < 0
]
During SKIM analysism, MIRA will estimate how repetitive parts of reads are. Parts which are occuring more than [-SK:fehr] but less than [-SK:fecr] times the average occurence will be tagged with a HAF6 (heavy repeat) tag. Parts which are occuring more than [-SK:fecr] but less than [-SK:nrr] times the average occurence will be tagged with a HAF7 (crazy repeat) tag.
on|yes|1, off|no|0
]
Default is dependent on --job type: yes for de-novo, no for mapping. Tells mira to mask during the SKIM phase subsequences of size [-SK:nph] nucleotides that appear more often than the median occurrence of subsequences would otherwise suggest. The threshold from which on subsequences are considered nasty is set by [-SK:nrr] (see below).
There's one drawback though: the smaller the reads are that you try to assemble with this option turned on, the higher the probability that your reads will not span nasty repeats completely, therefore leading to a abortion of contig building at this site.
The masked parts are tagged with "MNRr" in the reads.
This option is extremely useful for assembly of larger projects (fungi-size) with a high percentage of repeats. Or in non-normalised EST projects, to get at least something assenbled.
Although it is expected that bacteria will not really need this, leaving it turned on will probably not harm except in unusual cases like several copies of (pro-)phages integrated in a genome.
integer ≥ 2
]
Default is depending on the [--job=...] parameters. Normally it's high (around 100) for genome assemblies, but much lower (20 or less) for EST assemblies.
Sets the ratio from which on subsequences are considered nasty and hidden from the SKIM overlapper with a MNRr tag. The value of 10 means: mask all k-mers of [-SK:bph] length which are occurring more than 10 times more often than the average of the project.
integer; 0, 5-8
]
Default is 6. Sets the
minimum level of the HAF tags from which on MIRA will report
tentatively repetitive sequence in the
*_info_readrepeats.lst
file of the info
directory.
A value of 0 means "switched off". The default value of , 6 means all subsequences tagged with HAF6, HAF7 and MNRr will be logged. If you, e.g., only wanted MNRr logged, you'd use 8 as parameter value.
See also [-SK:fenn:fexn:fer:fehr:mnr:nrr] to set the different levels for the HAF and MNRr tags.
integer ≥ 0
]
Default is 0. If the number of reads identified as megahubs exceeds the allowed ratio, mira will abort.
This is a fail-safe parameter to avoid assemblies where things look fishy. In case you see this, you might want to ask for advice on the mira_talk mailing list. In short: bacteria should never have megahubs (90% of all cases reported were contamination of some sort and the 10% were due to incredibly high coverage numbers). Eukaryotes are likely to contain megahubs if filtering is [-SK:mnr] not on.
EST project however, especially from non-normalised libraries, will very probably contain megahubs. In this case, you might want to think about masking, see [-SK:mnr].
integer ≥ 100000
]
Default is 15000000. Has no influence on the quality of the assembly, only on the maximum memory size needed during the skimming. The default value is equivalent to approximately 500MB.
Note: reducing the number will increase the run time, the more drastically the bigger the reduction. On the other hand, increasing the default value chosen will not result in speed improvements that are really noticeable. In short: leave this number alone if you are not desperate to save a few MB.
integer ≥ 10
]
Default is 1024, 2048 when Solexa sequences are used. Maximum memory used (in MiB) during the reduction of skim hits.
Note: has no influence on the quality of the assembly, reducing the number will increase the runtime, the more drastically the bigger the reduction as hits then must be streamed multiple times from disk.
The default is good enough for assembly of bacterial genomes or small eukaryotes (using Sanger and/or 454 sequences). As soon as assembling something bigger than 20 megabases, you should increase it to 2048 or 4096 (equivalent to 2 or 4 GiB of memory).
The align options control the behaviour of the Smith-Waterman alignment routines. Only read pairs which are confirmed here may be included into contigs. Affects both the checking of possible alignments found by SKIM as well as the phase when reads are integrated into a contig.
Every option in this section can be set individually for every sequencing technology, giving a very fine grained control on how reads are aligned for each technology.
integer > 0 and ≤100
]
Default is dependent of the sequencing technology used. The banded Smith-Waterman alignment uses this percentage number to compute the bandwidth it has to use when computing the alignment matrix. E.g., expected overlap is 150 bases, bip=10 -> the banded SW will compute a band of 15 bases to each side of the expected alignment diagonal, thus allowing up to 15 unbalanced inserts / deletes in the alignment. INCREASING AND DECREASING THIS NUMBER: increase: will find more non-optimal alignments, but will also increase SW runtime between linear and \Circum2. decrease: the other way round, might miss a few bad alignments but gaining speed.
integer > 0
]
Default is dependent of the sequencing technology used. Minimum bandwidth in bases to each side.
integer > 0
]
Default is dependent of the sequencing technology used. Maximum bandwidth in bases to each side.
integer > 0
]
Default is dependent of the sequencing technology used. Minimum number of overlapping bases needed in an alignment of two sequences to be accepted.
integer > 0
]
Default is dependent of the sequencing technology used. Describes the minimum score of an overlap to be taken into account for assembly. mira uses a default scoring scheme for SW align: each match counts 1, a match with an N counts 0, each mismatch with a non-N base -1 and each gap -2. Take a bigger score to weed out a number of chance matches, a lower score to perhaps find the single (short) alignment that might join two contigs together (at the expense of computing time and memory).
integer > 0 and ≤100
]
Default is dependent of the sequencing technology used. Describes the min % of matching between two reads to be considered for assembly. Increasing this number will save memory, but one might loose possible alignments. I propose a maximum of 80 here. Decreasing below 55% will make memory and time consumption probably explode.
Note: most of the time it makes sense to keep this parameter in sync with [-SK:pr].
on|yes|1, off|no|0
]
Default is dependent of the sequencing technology used. Defines whether or not to increase penalties applied to alignments containing long gaps. Setting this to 'yes' might help in projects with frequent repeats. On the other hand, it is definitively disturbing when assembling very long reads containing multiple long indels in the called base sequence ... although this should not happen in the first place and is a sure sign for problems lying ahead.
When in doubt, set it to yes for EST projects and de-novo genome assembly, set it to no for assembly of closely related strains (assembly against a backbone).
When set to no, it is recommended to have [-CO:amgb] and [-CO:amgbemc] both set to yes.
low|0, medium|1, high|2, split_on_codongaps|10
]
Default is dependent of the sequencing technology used. Has no effect if extra_gap_penalty is off. Defines an extra penalty applied to 'long' gaps. There are these are predefined levels: low - use this if you expect your base caller frequently misses 2 or more bases. medium - use this if your base caller is expected to frequently miss 1 to 2 bases. high - use this if your base caller does not frequently miss more than 1 base.
For some stages of the EST assembly process, a special value split_on_codongaps is used. It's even a tick harsher that the 'high' level.
Also, usage of this parameter is probably a good thing if the repeat marker of the contig is set to not mark on gap bases ([-CO:amgb] equals to no). This is generally the case for 454 data.
0 ≤ integer ≤ 100
]
Default is 100. Has no effect if extra_gap_penalty is off. Defines the maximum extra penalty in percent applied to 'long' gaps.
The contig options control the behaviour of the contig objects.
string
]
Default is <projectname>. Contigs will have this string prepended to their names. The [-project=] quick-switch will also change this option.
integer > 0 and ≤100
]
Default is dependent of the sequencing technology used. When adding reads to a contig, reject the reads if the drop in the quality of the consensus is > the given value in %. Lower values mean stricter checking. This value is doubled should a read be entered that has a template partner (a read pair) at the right distance.
on|yes|1, off|no|0
]
Default is yes. One of the most important switches in MIRA: if set to yes, MIRA will try to resolve misassemblies due to repeats by identifying single base stretch differences and tag those critical bases as RMB (Repeat Marker Base, weak or strong). This switch is also needed when MIRA is run in EST mode to identify possible inter-, intra- and intra-and-interorganism SNPs.
on|yes|1, off|no|0
]
Default is no. Only takes effect when [-CO:mr] (see above) is set to yes. If set to yes, MIRA will not use the repeat resolving algorithm during build time (and therefore will not be able to take advantage of this), but only before saving results to disk.
This switch is useful in some (rare) cases of mapping assembly.
on|yes|1, off|no|0
]
Default is no. Only takes effect when [-CO:mr] (see above) is set to yes, effect is also dependent on the fact whether strain data (see - [-SB:lsd]) is present or not. Usually, mira will mark bases that differentiate between repeats when a conflict occurs between reads that belong to one strain. If the conflict occurs between reads belonging to different strains, they are marked as SNP. However, if this switch is set to yes, conflict within a strain are also marked as SNP.
This switch is mainly used in assemblies of ESTs, it should not be set for genomic assembly.
integer ≥ 2
]
Default is dependent of the sequencing technology used. Only takes effect when [-CO:mr] (see above) is set to yes. This defines the minimum number of reads in a group that are needed for the RMB (Repeat Marker Bases) or SNP detection routines to be triggered. A group is defined by the reads carrying the same nucleotide for a given position, i.e., an assembly with mrpg=2 will need at least two times two reads with the same nucleotide (having at least a quality as defined in [-CO:mgqrt]) to be recognised as repeat marker or a SNP. Setting this to a low number increases sensitivity, but might produce a few false positives, resulting in reads being thrown out of contigs because of falsely identified possible repeat markers (or wrongly recognised as SNP).
integer ≥
10
]
Default is dependent of the sequencing technology used. Takes only effect when [-CO:mr] is set to yes. This defines the minimum quality of neighbouring bases that a base must have for being taken into consideration during the decision whether column base mismatches are relevant or not.
integer ≥ 25
]
Default is dependent of the sequencing technology used. Takes only effect when [-CO:mr] is set to yes. This defines the minimum quality of a group of bases to be taken into account as potential repeat marker. The lower the number, the more sensitive you get, but lowering below 25 is not recommended as a lot of wrongly called bases can have a quality approaching this value and you'd end up with a lot of false positives. The higher the overall coverage of your project, the better, and the higher you can set this number. A value of 35 will probably remove most false positives, a value of 40 will probably never show false positives ... but will generate a sizable number of false negatives.
integer ≥ 0
]
Default is dependent of the sequencing technology used. Takes only effect when [-CO:mr] is set to yes. Using the end of sequences of Sanger type shotgun sequencing is always a bit risky, as wrongly called bases tend to crowd there or some sequencing vector relics hang around. It is even more risky to use these stretches for detecting possible repeats, so one can define an exclusion area where the bases are not used when determining whether a mismatch is due to repeats or not.
on|yes|1,
off|no|0
]
Default is yes. When [-CL:pec] is set, the end-read exclusion area can be considerably reduced. Setting this parameter will automatically do this.
![]() | Note |
---|---|
Although the parameter is named "set to 1", it may be that the exclusion area is actually a bit larger (2 to 4), depending on what users will report back as "best" option. |
on|yes|1, off|no|0
]
Default is dependent of the sequencing technology used. Determines whether columns containing gap bases (indels) are also tagged.
Note: it is strongly recommended to not set this to 'yes' for 454 type data.
on|yes|1,
off|no|0
]
Default is yes. Takes effect only when [-CO:amgb] is set to yes. Determines whether multiple columns containing gap bases (indels) are also tagged.
on|yes|1, off|no|0
]
Default is yes. Takes effect only when [-CO:amgb] is set to yes. Determines whether both for tagging columns containing gap bases, both strands.need to have a gap. Setting this to no is not recommended except when working in desperately low coverage situations.
on|yes|1, off|no|0
]
Default is no for all sequencing types. If set to yes, mira will be forced to make a choice for a consensus base (A,C,G,T or gap) even in unclear cases where it would normally put a IUPAC base. All other things being equal (like quality of the possible consensus base and other things), mira will choose a base by either looking for a majority vote or, if that also is not clear, by preferring gaps over T over G over C over finally A.
mira makes a considerable effort to deduce the right base at each position of an assembly. Only when cases begin to be borderline it will use a IUPAC code to make you aware of potential problems. It is suggested to leave this option to no as IUPAC bases in the consensus are a sign that - if you need 100% reliability - you really should have a look at this particular place to resolve potential problems. You might want to set this parameter to yes in the following cases: 1) when your tools that use assembly result cannot handle IUPAC bases and you don't care about being absolutely perfect in your data (by looking over them manually). 2) when you assemble data without any quality values (which you should not do anyway), then this method will allow you to get a result without IUPAC bases that is "good enough" with respect to the fact that you did not have quality values.
Important note: in case you are working with a hybrid assembly, mira will still use IUPAC bases at places where reads from different sequencing types contradict each other. In fact, when not forcing non-IUPAC bases for hybrid assemblies, the overall consensus will be better and probably have less IUPAC bases as mira can make a better use of available information.
on|yes|1, off|no|0
]
Default is yes for all Solexas when in a mapping assembly, else it's no. Can only be used in mapping assemblies. If set to yes, MIRA will merge all perfectly mapping Solexa reads into longer reads (Coverage Equivalent Reads, CERs) while keeping quality and coverage information intact.
This feature hugely reduces the number of Solexa reads and makes assembly results with Solexa data small enough to be handled by current finishing programs (gap4, consed, others) on normal workstations.
-1, integer > 0
]
Default is -1 for all Solexas when in a mapping assembly. Takes only effect in mapping assemblies if [-CO:msr=yes] and for reads which have a paired-end / mate-pair partner actively used in the assembly.
If set to a value > 0, MIRA will not merge paired-end / mate-pair reads if they map within the given distance of a contig end of the original reference sequence (backbone). Instead of a fixed value, one can also use -1. MIRA will then automatically not merge reads if the distance from the contig end is within the maximum size of the template insert size of the sequencing library for that read (either given via [-GE:tismax] or via XML TRACEINFO for the given read).
This feature allows to use the data reduction from [-CO:msr] while enabling the result of such a mapping to be useful in subsequent scaffolding programs to order contigs.
General options for controlling the integrated automatic editor. The editors generally make a good job cleaning up alignments from typical sequencing errors like (like base overcalls etc.). However, they may prove tricky in certain situations:
in EST assemblies, they may edit rare transcripts toward almost identical, more abundant transcripts. Usage must be carefully weighed.
the editors will not only change bases, but also sometimes delete or insert non-gap bases as needed to improve an alignment when facts (trace signals or other) show that this is what should have been the sequence. However, this can make post processing of assembly results pretty difficult with some formats like ACE, where the format itself contains no way to specify certain edits like deletion. There's nothing one can do about it and the only way to get around this problem is to use file formats with more complete specifications like CAF, MAF (and BAF once supported by MIRA).
The following edit parameters are supported:
on|yes|1, off|no|0
]
Default is no. Once contigs have been build, mira can call a built-in versions of the automatic contig editors. For Sanger reads this is EdIt, for 454 reads it is a specially crafted editor that knows about deficiencies of the 454 technology (homopolymers).
EdIt will try to resolve discrepancies in the contig by performing trace analysis and correct even hard to resolve errors. This option is always useful, but especially in conjunction with [-AS:nop] and [-DP:ure] (see above).
Notice 1: the current development version has a memory leak in the editor, therefore the option is not automatically turned on.
Notice 2: it is strongly suggested to turn this option on for 454 data as this greatly improves the quality.
on|yes|1, off|no|0
]
Default is yes. Only for Sanger data. If set to yes, the automatic editor will not take error hypotheses with a low probability into account, even if all the requirements to make an edit are fulfilled.
integer, 0 < x ≤ 100
]
Default is 50. Only for Sanger data. The higher this value, the more strict the automatic editor will apply its internal rule set. Going below 40 is not recommended.
Options which would not fit elsewhere.
on|yes|1, off|no|0
]
Default is yes. MIRA will check whether the tmp directory is running on an NFS mount. If it is and [-MI:sonfs] is active, MIRA will stop with a warning message.
![]() | Warning |
---|---|
You should never ever at all run MIRA on a NFS mounted directory ... or face the the fact that the assembly process may very well take 5 to 10 times longer (or more) than normal. You have been warned. The reason for the slowdown is the same as why one should never run a BLAST search on a big database being located on a NFS volume: access via network is terribly slow when compared to local disks, at least if you have not invested a lot of money into specialised solutions. |
integer <
0
]
Default is 500. This
parameter has absolutely no influence whatsoever on the
assembly process of MIRA. But is used in the reporting within
the *_assembly_info.txt
file after the assembly
where MIRA reports statistics on large
and all contigs. [-MI:lcs] is
the threshold value for categorising contigs.
integer <
0
]
Default is 5000 for [--job=genome] and 1000 for [--job=est].
This parameter is used for internal statistics calculations and has a subtle influence when being in a [--job=genome] assembly mode.
MIRA uses coverage information of an assembly project to find out about potentially repetitive areas in reads (and thus, a genome). To calculate statistics which are reflecting the approximate truth, the value of [-MI:lcs4s] is used as a cutoff threshold: contigs smaller than this value do not contribute to the calculation of average coverage while contigs larger or equal to this value do. Having this cutoff discards small contigs which tend to muddy the picture of average coverage of a project.
If in doubt, don't touch this parameter.
General options for controlling where to find or where to write data.
<directoryname>
]
Default is an empty string. When encountered during parameter parsing, MIRA will change the working directory immediately to the directory given and read and write files there.
Therefore, a call like mira -DI:cwd=/somedir
--params=myparameters.txt
will be enough to let MIRA
change to the directory /somedir
and then
read further parameters from a text file
myparamaters.txt
(which should be present
there) and at the same time have all the input and output of
the assembly occuring in firectory
/somedir
.
<directoryname>
]
Default is an empty string. When set to a non-empty string, MIRA will create the tmp directory at the given location instead of using the current working directory.
This option is particularly useful for systems which have solid state disks (SSDs) and some very fast disk subsystems which can be used for temporary files. Or in projects where the input and output files reside on a NFS mounted directory (current working dir), to put the tmp directory somewhere outside the NFS (see also: Things you should not do).
In both cases above, and for larger projects, MIRA then runs a lot faster.
<directoryname>
]
Default is gap4da.
Defines the extension of the directory where mira will write the
result of an assembly ready to import into the Staden package (GAP4) in
Direct Assembly format. The name of the directory will then be
<projectname>_.<extension>
<directoryname>
]
Default is .. Defines the directory where mira should search for experiment files (EXP).
<directoryname>
]
Default is .. Defines the directory where mira should search for SCF files.
The file options allows you to define your own input and output files.
string
]
Default is <projectname>_in.<seqtype>.fasta. Defines the fasta file to load sequences of a project from.
string
]
Default is <projectname>_in.<seqtype>.fasta.qual. Defines the file containing base qualities. Although the order of reads in the quality file does not need to be the same as in the fasta or fofn (although it saves a bit of time if they are).
string
]
Default is <projectname>_in.<seqtype>.fastq. Defines the fastq file to load sequences of a project from.
string
]
Default is <projectname>_in.<seqtype>.caf. Defines the file to load a CAF project from. Filename must end with '.caf'.
string
]
Default is <projectname>_in.fofn. Defines the file of filenames where the names of the EXP files of a project are located.
string
]
Default is <projectname>_in.fofn. Defines the file of filenames where the names of the PHD files of a project are located. Note: this is currently not available.
string
]
Default is <projectname>_in.phd. Defines the file of where all the sequences of a project are in PHD format.
string
]
Default is <projectname>_straindata_in.txt. Defines the file to load straindata from..
string
]
Default is <projectname>_xmltraceinfo_in.<seqtype>.xml. Defines the file to load a trace info file in XML format from. This can be used both when merging XML data to loaded files or when loading a project from an XML trace info file.
string
]
Default is <projectname>_ssaha2vectorscreen_in.txt. Defines the file to load a the info about possible vector sequence stretches.
string
]
Default is <projectname>_smaltvectorscreen_in.txt. Defines the file to load a the info about possible vector sequence stretches.
string
]
Default is <projectname>_in.<seqtype>.<filetype>. Defines the file to load a backbone from. Note that you still must define the file type with [-SB:bft].
Options for controlling which results to write to which type of files. Additionally, a few options allow output customisation of textual alignments (in text and HTML files).
There are 3 types of results: result, temporary results and extra temporary results. One probably needs only the results. Temporary and extra temporary results are written while building different stages of a contig and are given as convenience for trying to find out why mira set some RMBs or disassembled some contigs.
Output can be generated in these formats: CAF, Gap4 Directed Assembly, FASTA, ACE, TCS, WIG, HTML and simple text.
Naming conventions of the files follow the rules described in section Input / Output, subsection Filenames.
on|yes|1,off|no|0
]
Default is no. Controls whether 'unimportant' singlets are written to the result files.
![]() | Note |
---|---|
Note that a value larger 1 of the [-AS:mrpc] parameter will disable the function of this parameter. |
on|yes|1,off|no|0
]
Default is yes. Controls whether singlets which have certain tags (see below) are written to the result files, even if [-OUT:sssip] (see above) is set.
If one of the (SRMr, CRMr, WRMr, SROr, SAOr, SIOr) tags appears in a singlet, MIRA will see that the singlets had been part of a larger alignment in earlier passes and even was part of a potentially 'important' decision. To give the possibility to human finishers to trace back the decision, these singlets can be written to result files.
![]() | Note |
---|---|
Note that a value larger 1 of the [-AS:mrpc] parameter will disable the function of this parameter. |
on|yes|1, off|no|0
]
Default is yes. Removes log and temporary files once they should not be needed anymore during the assembly process.
on|yes|1, off|no|0
]
Default is no. Removes the complete tmp directory at the end of the assembly process. Some logs and temporary files contain useful information that you may want to analyse though, therefor the default of MIRA is not to delete it.
on|yes|1,
off|no|0
]
Default is yes.
on|yes|1,
off|no|0
]
Default is yes.
on|yes|1, off|no|0
]
Default is yes for projects only with Sanger reads, 'no' as soon as there are 454, Solexa or SOLiD reads involved.
![]() | Note |
---|---|
MIRA will automatically switch to no (and cannot be forced to 'yes') when 454 or Solexa reads are present in the project as this ensure that the file system does not get flooded with millions of files. |
on|yes|1, off|no|0
]
Default is yes.
on|yes|1,
off|no|0
]
Default is yes.
![]() | Note |
---|---|
The ACE output of MIRA is conforming to the file specification given in the consed documentation. However, due to a bug in consed, consed cannot correctly load tags set by MIRA. There is a workaround: the MIRA distribution comes with a small Tcl script fixACE4consed.tcl which implements a workaround to allow consed loading the ACE generated by MIRA. Use the script like this:
and then load the resulting outfile into consed. |
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is yes.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
integer > 0
]
Default is 60. When producing an output in text format ( [-OUT:ort|ott|oett]), this parameter defines how many bases each line of an alignment should contain.
integer > 0
]
Default is 60. When producing an output in HTML format, ( [-OUT:orh|oth|oeth]), this parameter defines how many bases each line of an alignment should contain.
<single character>
]
Default is (a blank). When producing an output in text format ( [-OUT:ort|ott|oett]), endgaps are filled up with this character.
<single character>
]
Default is (a blank). When producing an output in HTML format ( [-OUT:orh|oth|oeth]), end-gaps are filled up with this character.
Since version 3.0.0, mira now puts all files and directories it
generates into one sub-directory which is named
. This directory contains up to four
sub-directories:
projectname
_assembly
: this directory contains all the
output files of the assembly in different formats.
projectname
_d_results
: this directory contains information
files of the final assembly. They provide statistics as well as, e.g.,
information (easily parseable by scripts) on which read is found in which
contig etc.
projectname
_d_info
:
this directory contains tmp files and temporary assembly files. It
can be safely removed after an assembly as there may be easily a
few GB of data in there that are not normally not needed anymore.
projectname
_d_tmp
In case of problems: please do not delete. I will get in touch with you for additional information that might possibly be present in the tmp directory.
: this directory
contains checkpoint files needed to resume assemblies that crashed
or were stopped.
projectname
_d_chkpt
![]() | Note |
---|---|
The checkpointing functionality has not been completely implemented yet and currently cannot be used. |
The input files must be placed (or linked to) in the directory from which mira is called.
projectname
_in.fofn
File of filenames containing the names of the experiment
or phd files to assemble when the [-LR:ft=FOFNEXP]
option is used. One filename per line, blank lines accepted,
lines starting with a hash (#
) are treated as
comment lines, nothing else. Use [-FN:fofnin] to
change the default name.
projectname
_in.phd
File containing the sequences (and their qualities) to assemble in PHD format.
projectname
_in.fasta
File containing sequences and ...
projectname
_in.fasta.qual
... file containing quality values of sequences for the assembly in FASTA format.
projectname
_in.fastq
FASTQ file containing sequences and qualities. MIRA automatically recognises Sanger FASTQ format (base quality offset = 33) and newer Illumina FASTQ format (base quality offset = 64). Old Illumina FASTQ format with negative base qualities (base offset < 64) is not supported anymore).
projectname
_in.caf
File containing the sequences (and their qualities) to assemble in CAF format. This format also may contain the result of an assembly (the contig consensus sequences).
These result output files and sub-directories are placed in in the
projectname
_results directory after a run of mira.
projectname
_out.<type>
Assembled project written in type = (gap4da / caf / ace / fasta / html / tcs / wig / text) format by mira, final result.
Type gap4da is a directory containing experiment files and a file of filenames (called 'fofn'), all other types are files. gap4da, caf, ace contain the complete assembly information suitable for import into different post-processing tools (gap4, consed and others). html and text contain visual representations of the assembly suited for viewing in browsers or as simple text file. tcs is a summary of a contig suited for "quick" analyses from command-line tools or even visual inspection. wig is a file containing coverage information (useful for mapping assemblies) which can be loaded and shown by different genome browsers (IGB, GMOD, USCS and probably many more.
fasta contains the contig consensus sequences (and .fasta.qual the consensus qualities). Please note that they come in two flavours: padded and unpadded. The padded versions may contains stars (*) denoting gap base positions where there was some minor evidence for additional bases, but not strong enough to be considered as a real base. Unpadded versions have these gaps removed. Padded versions have an additional postfix .padded, while unpadded versions do not have a special postfix.
These information files are placed in in the
projectname
_info directory after a run of
mira.
projectname
_info_assembly.txt
This file contains basic information about the assembly. MIRA will split the information in two parts: information about large contigs and information about all contigs.
For more information on how to interpret this file, please consult the chapter on "Results" of the MIRA documentation manual.
![]() | Note |
---|---|
In contrast to other information files, this file appears always in the "info" directory, even when just intermediate results are reported. |
projectname
_info_contigreadlist.txt
This file contains information which reads have been assembled into which contigs (or singlets).
projectname
_info_contigstats.txt
This file contains statistics about the contigs themselves, their length, average consensus quality, number of reads, maximum and average coverage, average read length, number of A, C, G, T, N, X and gaps in consensus.
projectname
_info_consensustaglist.txt
This file contains information about the tags (and their position) that are present in the consensus of a contig.
projectname
_info_readrepeats.lst
Tab delimited file with three columns: read name, repeat level tag, sequence.
This file permits a quick analysis of the repetitiveness of different parts of reads in a project. See [-SK:rliif] to control from which repetitive level on subsequences of reads are written to this file,
![]() | Note |
---|---|
Reads can have more than one entry in this file. E.g., with
standard settings (-SK:rliif=6 ) if the
start of a read is covered by MNRr, followed by a HAF3 region
and finally the read ends with HAF6, then there will be two
lines in the file: one for the subsequence covered by MNRr,
one for HAF6.
|
projectname
_info_readstooshort
A list containing the names of those reads that have been sorted out of the assembly before any processing started only due to the fact that they were too short.
projectname
_info_readtaglist.txt
This file contains information about the tags and their position that are present in each read. The read positions are given relative to the forward direction of the sequence (i.e. as it was entered into the the assembly).
projectname
_error_reads_invalid
A list of sequences that have been found to be invalid due to various reasons (given in the output of the assembler).
MIRA can write almost all of the following formats and can read most of them.
EXP
Standard experiment files used in genome sequencing. Correct EXP files are expected. Especially the ID record (containing the id of the reading) and the LN record (containing the name of the corresponding trace file) should be correctly set. See http://www.sourceforge.net/projects/staden/ for links to online format description.
SCF
The Staden trace file format that has established itself as compact standard replacement for the much bigger ABI files. See http://www.sourceforge.net/projects/staden/ for links to online format description.
The SCF files should be V2-8bit, V2-16bit, V3-8bit or V3-16bit and can be packed with compress or gzip.
CAF
Common Assembly Format (CAF) developed by the Sanger Centre. http://www.sanger.ac.uk/resources/software/caf.html provides a description of the format and some software documentation as well as the source for compiling caf2gap and gap2caf (thanks to Rob Davies for this).
ACE
The assembly file format used mainly by phrap and consed. Support for .ace output is currently only in test status in mira as documentation on that format is ... sparse and I currently don' have access to consed to verify my assumptions.
Using consed, you will need to load projects with -nophd to view them. Tags /in reads and consensus) are fully supported. The only hitch: consed has a bug which prevents it to read consensus tags which are located throughout the whole file (as MIRA writes per default). The solution to that is easy: filter the CAF file through the fixACE4consed.tcl script which is provided in the MIRA distributions, then all should be well.
If you don't have consed, you might want to try clview (http://www.tigr.org/tdb/tgi/software/) from TIGR to look at .ace files.
MAF
MIRA Assembly Format (MAF). A faster and more compact form than EXP, CAF or ACE. See documentation in separate file.
HTML
Hypertext Markup Language. Projects written in HTML format can be viewed directly with any table capable browser. Display is even better if the browser knows style sheets (CSS).
FASTA
A simple format for sequence data, see http://www.ncbi.nlm.nih.gov/BLAST/fasta.html. An often used extension of that format is used to also store quality values in a similar fashion, these files have a .fasta.qual ending.
Mira writes two kinds of FASTA files for results: padded and unpadded. The difference is that the padded version still contains the gap (pad) character (an asterisk) at positions in the consensus where some of the reads apparently had some more bases than others but where the consensus routines decided that to treat them as artifacts. The unpadded version has the gaps removed.
PHD
This file type originates from the phred base caller and contains basically -- along with some other status information -- the base sequence, the base quality values and the peak indices, but not the sequence traces itself.
GBF, GBK
GenBank file format as used at the NCBI to describe sequences. mira is able to read this format for using sequences as backbones in an assembly. Features of the GenBank format are also transferred automatically to Staden compatible tags.
traceinfo.XML
XML based file with information relating to traces. Used at the NCBI and ENSEMBL trace archive to store additional information (like clippings, insert sizes etc.) for projects. See further down for for a description of the fields used and http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=rfc&m=main&s=rfc for a full description of all fields.
TCS
Transpose Contig Summary. A text file as written by mira which gives a summary of a contig in tabular fashion, one line per base. Nicely suited for "quick" analyses from command line tools, scripts, or even visual inspection in file viewers or spreadsheet programs.
In the current file version (TCS 1.0), each column is separated by at least one space from the next. Vertical bars are inserted as visual delimiter to help inspection by eye. The following columns are written into the file:
contig name (width 20)
padded position in contigs (width 3)
unpadded position in contigs (width 3)
separator (a vertical bar)
called consensus base
quality of called consensus base (0-100), but MIRA itself caps at 90.
separator (a vertical bar)
total coverage in number of reads. This number can be higher than the sum of the next five columns if Ns or IUPAC bases are present in the sequence of reads.
coverage of reads having an "A"
coverage of reads having an "C"
coverage of reads having an "G"
coverage of reads having an "T"
coverage of reads having an "*" (a gap)
separator (a vertical bar)
quality of "A" or "--" if none
quality of "C" or "--" if none
quality of "G" or "--" if none
quality of "T" or "--" if none
quality of "*" (gap) or "--" if none
separator (a vertical bar)
Status. This field sums up the evaluation of MIRA whether you should have a look at this base or not. The content can be one of the following:
everything OK: a colon (:)
unclear base calling (IUPAC base): a "!M"
potentially problematic base calling involving a gap or low quality: a "!m"
consensus tag(s) of MIRA that hint to problems: a "!$". Currently, the following tags will lead to this marker: SRMc, WRMc, DGPc, UNSc, IUPc.
list of a consensus tags at that position, tags are delimited by a space. E.g.: "DGPc H454"
The actual stage of the assembly is written to STDOUT, giving status messages on what mira is actually doing. Dumping to STDERR is almost not used anymore by MIRA, remnants will disappear over time.
Some debugging information might also be written to STDOUT if mira generates error messages.
On errors, MIRA will dump these also to STDOUT. Basically, three error classes exist:
WARNING: Messages in this error class do not stop the assembly but are meant as an information to the user. In some rare cases these errors are due to (an always possible) error in the I/O routines of mira, but nowadays they are mostly due to unexpected (read: wrong) input data and can be traced back to errors in the preprocessing stages. If these errors arise, you definitively DO want to check how and why these errors came into those files in the first place.
Frequent cause for warnings include missing SCF files, SCF files containing known quirks, EXP files containing known quirks etc.
FATAL: Messages in this error class actually stop the assembly. These are mostly due to missing files that mira needs or to very garbled (wrong) input data.
Frequent causes include naming an experiment file in the 'file of filenames' that could not be found on the disk, same experiment file twice in the project, suspected errors in the EXP files, etc.
INTERNAL: These are true programming errors that were caught by internal checks. Should this happen, please mail the output of STDOUT and STDERR to the author.
MIRA extracts the following data from the TRACEINFO files:
trace_name (required)
trace_file (recommended)
trace_type_code (recommended)
trace_end (recommended)
clip_quality_left (recommended)
clip_quality_right (recommended)
clip_vector_left (recommended)
clip_vector_right (recommended)
strain (recommended)
template_id (recommended for paired end)
insert_size (recommended for paired end)
insert_stdev (recommended for paired end)
machine_type (optional)
program_id (optional)
Other data types are also read, but the info is not used.
Here's the example for a TRACEINFO file with ancillary info:
<?xml version="1.0"?> <trace_volume> <trace> <trace_name>GCJAA15TF</trace_name> <program_id>PHRED (0.990722.G) AND TTUNER (1.1)</program_id> <template_id>GCJAA15</template_id> <trace_direction>FORWARD</trace_direction> <trace_end>F</trace_end> <clip_quality_left>3</clip_quality_left> <clip_quality_right>622</clip_quality_right> <clip_vector_left>1</clip_vector_left> <clip_vector_right>944</clip_vector_right> <insert_stdev>600</insert_stdev> <insert_size>2000</insert_size> </trace> <trace> ... </trace> ... </trace_volume>
See http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=rfc&m=main&s=rfc for a full description of all fields and more info on the TRACEINFO XML format.
MIRA names contigs the following way: <projectname>_<contigtype><number>. While <projectname> is dictated by the [--project=] parameter and <number> should be clear, the <contigtype> might need additional explaining. There are currently three contig types existing:
_c: these are "normal" contigs
_rep_c: these are contigs containing only repetitive areas. These contigs had _lrc as type in previous version of MIRA, this was changed to the _rep_c to make things clearer.
_s: these are singlet-contigs. Technically: "contigs" with a single read.
Basically, for genome assemblies MIRA starts to build contigs in areas which seem "rock solid", i.e., not a repetitive region (main decision point) and nice coverage of good reads. Contigs which started like this get a _c name. If during the assembly MIRA reaches a point where it cannot start building a contig in a non-repetitive region, it will name the contig _rep_c instead of _c.
![]() | Note |
---|---|
Although the distinction between _c and _rep_c makes sense only for genome assemblies, EST assemblies also use it (for no better reason than me not having an alternative or better naming scheme there). |
![]() | Note |
---|---|
Depending on the settings of [-AS:mrpc], your project may or may not contain _s singlet-contigs. Also note that reads landing in the debris file will not get assigned to singlet-contigs and hence not get _s names. |
In case you used strain information in an assembly, you can recover the consensus for just any given strain by using convert_project and convert from a full assembly format (e.g. MAF or CAF) which also carries strain information to FASTA. MIRA will automatically detect the strain information and create one FASTA file per strain encountered.
![]() | Note |
---|---|
To be able to distinguish between consensus bases with a
'N ' call and areas of a strain which were
not covered at all by any read of that strain, MIRA introduces
the '@ ' sign as additional "base". That is,
if you see a '@ ' in the consensus of a
given strain, this may be either due to too low coverage --
and therefore a hole -- or to a genuine deletion in your
strain.
|
MIRA uses and sets a couple of tags during the assembly process. That is, if information is known before the assembly, it can be stored in tags (in the EXP and CAF formats) and will be used in the assembly.
This section lists "foreign" tags, i.e., tags that whose definition was made by other software packages than MIRA.
ALUS, REPT: Sequence stretches tagged as ALUS (ALU Sequence) or REPT (general repetitive sequence) will be handled with extreme care during the assembly process. The allowed error rate after automatic contig editing within these stretches is normally far below the general allowed error rate, leading to much higher stringency during the assembly process and subsequently to a better repeat resolving in many cases.
FpAS: GenBank feature for a poly-A signal. Used in EST, cDNA or transcript assembly. Either read in the input files or set when using [-CL:cpat]. This allows to keep the poly-A signal in the reads during assembly without them interfering as massive repeats or as mismatches.
FCDS, Fgen: GenBank features as described in GBF/GBK files or set in the Staden package are used to make some SNP impact analysis on genes.
other. All other tags in reads will be read and passed through the assembly without being changed and they currently do not influence the assembly process.
This section lists tags which MIRA sets (and reads of course), but that other software packages might not know about.
UNSr, UNSc: UNSure in Read respectively Contig. These tags denote positions in an assembly with conflicts that could not be resolved automatically by mira. These positions should be looked at during the finishing process.
For assemblies using good sequences and enough coverage, something 0.01% of the consensus positions have such a tag. (e.g. ~300 UNSc tags for a genome of 3 megabases).
SRMr, WRMc: Strong Repeat Marker and Weak Repeat Marker. These tags are set in two flavours: as SRMr and WRMr when set in reads, and as SRMc and WRMc when set in the consensus. These tags are used on an individual per base basis for each read. They denote bases that have been identified as crucial for resolving repeats, often denoting a single SNP within several hundreds or thousands of bases. While a SRM is quite certain, the WRM really is either weak (there wasn't enough comforting information in the vicinity to be really sure) or involves gap columns (which is always a bit tricky).
mira will automatically set these tags when it encounters repeats and will tag exactly those bases that can be used to discern the differences.
Seeing such a tag in the consensus means that mira was not able to finish the disentanglement of that special repeat stretch or that it found a new one in one of the last passes without having the opportunity to resolve the problem.
DGPc: Dubious Gap Position in Consensus. Set whenever the gap to base ratio in a column of 454 reads is between 40% and 60%.
SAO, SRO, SIO: SNP intrA Organism, SNP R Organism, SNP Intra and inter Organism. As for SRM and WRM, these tags have a r appended when set in reads and a c appended when set in the consensus. These tags denote SNP positions.
mira will automatically set these tags when it encounters SNPs and will tag exactly those bases that can be used to discern the differences. They denote SNPs as they occur within an organism (SAO), between two or more organisms (SRO) or within and between organisms (SIO).
Seeing such a tag in the consensus means that mira set this as a valid SNP in the assembly pass. Seeing such tags only in reads (but not in the consensus) shows that in a previous pass, mira thought these bases to be SNPs but that in later passes, this SNP does not appear anymore (perhaps due to resolved misassemblies).
STMS: (only hybrid assemblies). The Sequencing Type Mismatch Solved is tagged to positions in the assembly where the consensus of different sequencing technologies (Sanger, 454, Ion Torrent, Solexa, PacBio, SOLiD) reads differ, but mira thinks it found out the correct solution. Often this is due to low coverage of one of the types and an additional base calling error.
Sometimes this depicts real differences where possible explanation might include: slightly different bugs were sequenced or a mutation occurred during library preparation.
STMU: (only hybrid assemblies). The Sequencing Type Mismatch Unresolved is tagged to positions in the assembly where the consensus of different sequencing technologies (Sanger, 454, Ion Torrent, Solexa, SOLiD) reads differ, but mira could not find a good resolution. Often this is due to low coverage of one of the types and an additional base calling error.
Sometimes this depicts real differences where possible explanation might include: slightly different bugs were sequenced or a mutation occurred during library preparation.
MCVc: The Missing Co{V}erage in Consensus. Set in assemblies with more than one strain. If a strain has no coverage at a certain position, the consensus gets tagged with this tag (and the name of the strain which misses this position is put in the comment). Additionally, the sequence in the result files for this strain will have an @ character.
MNRr: (only with [-SK:mnr] active). The Masked Nasty Repeat tags are set over those parts of a read that have been detected as being many more times present than the average sub-sequence. mira will hide these parts during the initial all-against-all overlap finding routine (SKIM3) but will otherwise happily use these sequences for consensus generation during contig building.
FpAS: See "Tags read (and used)" above.
ED_C, ED_I, ED_D: EDit Change, EDit Insertion, EDit Deletion. These tags are set by the integrated automatic editor EdIt and show which edit actions have been performed.
HAF2, HAF3, HAF4, HAF5, HAF6, HAF7. These are HAsh Frequency tags which show the status of read parts in comparison to the whole project. Only set if [-AS:ard] is active (default for genome assemblies).
More info on how to use the information conveyed by HAF tags in the section dealing with repeats and HAF tags in finishing programs further down in this manual.
HAF2 coverage below average ( standard setting at < 0.5 times average)
HAF3 coverage is at average ( standard setting at ≥ 0.5 times average and ≤ 1.5 times average)
HAF4 coverage above average ( standard setting at > 1.5 times average and < 2 times average)
HAF5 probably repeat ( standard setting at ≥ 2 times average and < 5 times average)
HAF6 'heavy' repeat ( standard setting at > 8 times average)
HAF7 'crazy' repeat ( standard setting at > 20 times average)
At the start, things are simple: a read either aligns with other reads or it does not. Reads which align with other reads form contigs, and these MIRA will save in the results with a contig name of _c.
However, not all reads can be placed in an assembly. This can have several reasons and these reads may end up at two different places in the result files: either in the debris file, then just as a name entry, or as singlet (a "contig" with just one read) in the regular results.
reads are too short and get filtered out (before or after the MIRA clipping stages). These invariably land in the debris file.
reads are real singlets: they contain genuine sequence but have no overlap with any other read. These get either caught by the [-CL:pec] clipping filter or during the SKIM phase
reads contain mostly or completely junk.
reads contain chimeric sequence (therefore: they're also junk)
MIRA filters out these reads in different stages: before and after read clipping, during the SKIM stage, during the Smith-Waterman overlap checking stage or during contig building.
The exact place where these single reads land is dependend on why they do not align with other reads.
MIRA is able to find and tag SNPs in any kind of data -- be it genomic or EST -- in both de-novo and mapping assemblies ... provided it knows which read in an assembly is coming from which strain, cell line or organism.
The SNP detection routines are based on the same routines as the routines for detecting non-perfect repeats. In fact, MIRA can even distinguish between bases marking a misassembled repeat from bases marking a SNP within the same project.
All you need to do to enable this feature is to set
[-CO:mr=yes] (which is standard in all
--job=...
incantations of mira and
in some steps of miraSearchESTSNPs. Furthermore, you
will need:
to provide a straindata file for the reads or have the strain information in ancillary NCBI TRACEINFO XML files.
to provide a straindata file for the reads and also give the reference sequence(s) (backbone(s)) a strain name via the [-SB:bsn] parameter.
The effect of using strain names attached to reads can be described
briefly like this. Assume that you have 6 reads (called R1 to R6), three
of them having an A
at a given position, the other
three a C
.
R1 ......A...... R2 ......A...... R3 ......A...... R4 ......C...... R5 ......C...... R6 ......C......
![]() | Note |
---|---|
This example is just that: an example. It uses just 6 reads, with two times three reads as read groups for demonstration purposes and without looking at qualities. For MIRA to recognise SNPs, a few things must come together (e.g. for many sequencing technologies it wants forward and backward reads when in de-novo assembly) and a couple of parameters can be set to adjust the sensitivity. Read more about the parameters: [-CO:mrpg:mnq:mgqrt:emea:amgb:amgbemc:amgbnbs] |
Now, assume you did not give any strain information. MIRA will most probably recognise a problem and, having no strain information, assume it made an error by assembling two different repeats of the same organism. It will tag the bases in the reads with repeat marker tags (SRMr) and the base in the consensus with a SROc tag (to point at an unresolved problem). In a subsequent pass, MIRA will then not assemble these six reads together again, but create two contigs like this:
Contig1: R1 ......A...... R2 ......A...... R3 ......A...... Contig2: R4 ......C...... R5 ......C...... R6 ......C......
The bases in the repeats will keep their SROr tags, but the consensus base of each contig will not get SROc as there is no conflict anymore.
Now, assume you gave reads R1, R2 and R3 the strain information "human", and read R4, R5 and R6 "chimpanzee". MIRA will then create this:
R1 (hum) ......A...... R2 (hum) ......A...... R3 (hum) ......A...... R4 (chi) ......C...... R5 (chi) ......C...... R6 (chi) ......C......
Instead of creating two contigs, it will create again one contig ... but it will tag the bases in the reads with a SROr tag and the position in the contig with a SROc tag. The SRO tags (SNP inteR Organisms) tell you: there's a SNP between those two (or multiple) strains/organisms/whatever.
Changing the above example a little, assume you have this assembly early on during the MIRA process:
R1 (hum) ......A...... R2 (hum) ......A...... R3 (hum) ......A...... R4 (chi) ......A...... R5 (chi) ......A...... R6 (chi) ......A...... R7 (chi) ......C...... R8 (chi) ......C...... R9 (chi) ......C......
Because "chimp" has a SNP within itself (A
versus
C
) and there's a SNP between "human" and "chimp"
(also A
versus C
), MIRA will see a
problem and set a tag, this time a SIOr tag: SNP Intra- and
inter Organism.
MIRA does not like conflicts occurring within an organism and will try to resolve these cleanly. After setting the SIOr tags, MIRA will re-assemble in subsequent passes this:
Contig1: R1 (hum) ......A...... R2 (hum) ......A...... R3 (hum) ......A...... R4 (chi) ......A...... R5 (chi) ......A...... R6 (chi) ......A...... Contig2: R7 (chi) ......C...... R8 (chi) ......C...... R9 (chi) ......C......
The reads in Contig1 (hum+chi) and Contig2 (chi) will keep their SIOr tags, the consensus will have no SIOc tag as the "problem" was resolved.
When presented to conflicting information regarding SNPs and possible repeat markers or SNPs within an organism, MIRA will always first try to resolve the repeats marker. Assume the following situation:
R1 (hum) ......A...T...... R2 (hum) ......A...G...... R3 (hum) ......A...T...... R4 (chi) ......C...G...... R5 (chi) ......C...T...... R6 (chi) ......C...G......
While the first discrepancy column can be "explained away" by a SNP between organisms (it will get a SROr/SROc tag), the second column cannot and will get a SIOr/SIOc tag. After that, MIRA opts to get the SIO conflict resolved:
Contig1: R1 (hum) ......A...T...... R3 (hum) ......A...T...... R5 (chi) ......C...T...... Contig2: R2 (hum) ......A...G...... R4 (chi) ......C...G...... R6 (chi) ......C...G......
The default parameters for MIRA assemblies work best when given real sequencing data and they even expect the data to behave like real sequencing data. But some assembly strategies work in multiple rounds, using so called "artificial" or "synthetic" reads in later rounds, i.e., data which was not generated through sequencing machines but might be something like the consensus of previous assemblies.
If one doesn't take utter care to make these artificial reads at least behave a little bit like real sequencing data, a number of quality ensurance algorithms of MIRA might spot that they "look funny" and trim back these artificial reads ... sometimes even removing them completely. The following list gives a short overview on what these synthetic reads should look like or which MIRA algorithms to switch off in certain cases:
Forward and reverse complement directions: most sequencing technologies and strategies yield a mixture of reads with both forward and reverse complement direction to the DNA sequenced. In fact, having both directions allows for a much better quality control of an alignment as sequencing technology dependent sequencing errors will often affect only one direction at a given place and not both (the exception being homopolymers and 454).
The MIRA proposed end clipping algorithm [-CL:pec] uses this knowledge to initially trim back ends of reads to an area without sequencing errors. However, if reads covering a given area of DNA are present in only one direction, then these reads will be completely eliminated.
If you use only artificial reads in an assembly, then switch off the proposed end clipping [-CL:pec=no].
If you mix artificial reads with "normal" reads, make sure that every part of an artificial read is covered by some other read in reverse complement direction (be it a normal or artificial read). The easiest way to do that is to add a reverse complement for every artificial read yourself, though if you use an overlapping strategy with artificial reads, you can calculate the overlaps and reverse complements of reads so that every second artificial read is in reverse complement to save time and memory afterwards during the computation.
Sequencing type/technology: MIRA currently knows Sangers, 454, Ion Torrent, Solexa and PacBio as sequencing technologies, every read entered in an assembly must be one of those.
Artificial reads should be classified depending on the data they were created from, that is, Sanger for consensus of Sanger reads, 454 for consensus of 454 reads etc. However, Should reads created from Illumina consensus be much longer than, say, 200 or 300 bases, you should treat them as Sanger reads.
Quality values: be careful to assign decent quality values to your artificial reads as several quality clipping or consensus calling algorithms make extensive use of qualities. Pay attention to values of [-CL:qc:bsqc] as well as to [-CO:mrpg:mnq:mgqrt].
Read lengths: current maximum read length for MIRA is around ~30kb. However, to account for some safety, MIRA currently allows only 20kb reads as maximum length.
MIRA treats ploidy differences as repeats and will therefore build a separate contigs for the reads of a ploidy that has a difference to the other ploidy/ploidies.
There is simply no other way to handle ploidy while retaining the ability to separate repeats based on differences of only a single base. Everything else would be guesswork. I thought for some time about doing a coverage analysis around the potential repeat/ploidy site, but came to the conclusion that due to the stochastic nature of sequencing data, this would very probably take wrong decisions in too many cases to be acceptable.
If someone has a good idea, I'll be happy to hear it.
Under the assumption that reads in a project are uniformly distributed across the genome, MIRA will enforce an average coverage and temporarily reject reads from a contig when this average coverage multiplied by a safety factor is reached at a given site. This strategy reduces overcompression of repeats during the contig building phase and keeps reads in reserve for other copies of that repeat.
It's generally a very useful tool disentangle repeats, but has some slight secondary effects: rejection of otherwise perfectly good reads. The assumption of read distribution uniformity is the big problem we have here: of course it's not really valid. You sometimes have less, and sometimes more than "the average" coverage. Furthermore, the new sequencing technologies - 454 perhaps but certainly the ones from Solexa - show that you also have a skew towards the site of replication origin.
Warning: Solexa data from late 2009 and 2010 show a high GC content bias. This bias can reach 200 or 300%, i.e., sequence part for with low GC
One example: let's assume the average coverage of a project is 8 and by chance at one place there 17 (non-repetitive) reads, then the following happens:
(Note: $p$ is the parameter [-AS:urdsip])
Pass 1 to $p-1$: MIRA happily assembles everything together and calculates a number of different things, amongst them an average coverage of ~8. At the end of pass $p-1$, it will announce this average coverage as first estimate to the assembly process.
Pass $p$: MIRA has still assembled everything together, but at the end of each pass the contig self-checking algorithms now include an "average coverage check". They'll invariably find the 17 reads stacked and decide (looking at the [-AS:ardct] parameter which is assumed to be 2 for this example) that 17 is larger than 2*8 and that this very well may be a repeat. The reads get flagged as possible repeats.
Pass $p+1$ to end: the "possibly repetitive" reads get a much tougher treatment in MIRA. Amongst other things, when building the contig, the contig now looks that "possibly repetitive" reads do not overstack by an average coverage multiplied by a safety value ([-AS:urdcm]) which we'll assume now to be 1.5 in this example. So, at a certain point, say when read 14 or 15 of that possible repeat want to be aligned to the contig at this given place, the contig will just flatly refuse and tell the assembler to please find another place for them, be it in this contig that is built or any other that will follow. Of course, if the assembler cannot comply, the reads 14 to 17 will end up as contiglet (contig debris, if you want) or if it was only one read that got rejected like this, it will end up as singlet or in the debris file.
Tough luck. I do have ideas on how to reintegrate those reads at the and of an assembly, but I have deferred doing this as in every case I had looked up, adding those reads to the contigs wouldn't have changed anything ... there's already enough coverage.
What should be done in those cases is simply filter away the contiglets (defined as being of small size and having an average coverage below the average coverage of the project divided 3 (or 2.5)) from a project.
MIRA had since 2.9.36 a feature to keep long repeats in separate contigs ([-AS:klrs]). Due to algorithm changes, this feature is now standard (even if the command line parameter is still present). The effect of this is that contigs with non-repetitive sequence will stop at a 'long repeat' border, including only the first few bases of the repeat. Long repeats will be kept as separate contigs.
This has been implemented to get a clean overview on which parts of an assembly are 'safe' and which parts will be 'difficult'. For this, the naming of the contigs has been extended: contigs named with a '_c' at the end are contigs which contain mostly 'normal' coverage. Contigs with "rep_c" are contigs which contain mostly sequence classified as repetitive and which could not be assembled together with a 'c' contig.
The question remains: what are 'long' repeats. MIRA defines these as repeats that are not spanned by any read that has non-repetitive parts at the end. So, basically, the mean length of the reads that go into the assembly defines the length of 'long' repeats that have to be kept in separate contigs.
It has to be noted that when using paired-end (or template) sequencing, 'long' repeats which can be spanned by read-pairs (or templates) are mostly integrated into 'normal' contigs as MIRA can correctly place them most of the time.
HAF tags (HAsh Frequency) are set by MIRA when the option to colour reads by hash frequency ([-GE:crhf], on by default in most --job combinations) is on. These tags show the status of k-mers (stretch of bases of given length $k$) in read sequences: whether MIRA recognised them as being present in sub-average, average, above average or repetitive numbers.
When using a finishing programs which can display tags in reads (and using the proposed tag colour schemes for gap4 or consed, the assembly will light up in colours ranging from light green to dark red, indicating whether a certain part of the assembly is deemed non-repetitive to extremely repetitive.
One of the biggest advantages of the HAF tags is the implicit information they convey on why the assembler stopped building a contig at an end.
if the read parts composing a contig end are mostly covered with HAF2 tags (below average frequency, coloured light-green), then one very probably has a hole in the contig due to coverage problems which means there are no or not enough reads covering a part of the sequence.
if the read parts composing a contig end are mostly covered with HAF3 tags (average frequency, coloured green), then you have an unusual situation as this should only very rarely occur. The reason is that MIRA saw that there are enough sequences which look the same as the one from your contig end, but that these could not be joined. Likely reasons for this scenario include non-random sequencing artifacts (seen in 454 data) or also non-random chimeric reads (seen in Sanger and 454 data).
if the read parts composing a contig end are mostly covered with HAF4 tags (above average frequency, coloured yellow), then the assembler stopped at grey zone of the coverage not being normal anymore, but not quite repetitive yet. This can happen in cases where the read coverage is very unevenly distributed across the project. The contig end in question might be a repeat occurring two times in the sequence, but having less reads than expected. Or it may be non-repetitive coverage with an unusual excess of reads.
if the read parts composing a contig end are mostly covered with HAF5 (repeat, coloured red), HAF6 (heavy repeat, coloured darker red) and HAF7 tags (crazy repeat, coloured very dark red), then there is a repetitive area in the sequence which could not be uniquely bridged by the reads present in the assembly.
This information can be especially helpful when joining reads by hand in a finishing program. The following list gives you a short guide to cases which are most likely to occur and what you should do.
the proposed join involves contig ends mostly covered by HAF2 tags. Joining these contigs is probably a safe bet. The assembly may have missed this join because of too many errors in the read ends or because sequence having been clipped away which could be useful to join contigs. Just check whether the join seems sensible, then join.
the proposed join involves contig ends mostly covered by HAF3 tags. Joining these contigs is probably a safe bet. The assembly may have missed this join because of several similar chimeric reads reads or reads with similar, severe sequencing errors covering the same spot. Just check whether the join seems sensible, then join.
the proposed join involves contig ends mostly covered by HAF4 tags. Joining these contigs should be done with some caution, it may be a repeat occurring twice in the sequence. Check whether the contig ends in question align with ends of other contigs. If not, joining is probably the way to go. If potential joins exist with other contigs, then it's a repeat (see below).
the proposed join involves contig ends mostly covered by HAF5, HAF6 or HAF7 tags. Joining these contigs should be done with utmost caution, you are almost certainly (HAF5) and very certainly (HAF6 and HAF7) in a repetitive area of your sequence. You will probably need additional information like paired-end or template info in order join your contigs.
MIRA goes a long way to calculate a consensus which is as correct as possible. Unfortunately, communication with finishing programs is a bit problematic as there currently is no standard way to say which reads are from which sequencing technology.
It is therefore often the case that finishing programs calculate an own consensus when loading a project assembled with MIRA. This is the case for at least, e.g., gap4. This consensus may then not be optimal.
The recommended way to deal with this problem is: import the results from MIRA into your finishing program like you always do. Then finish the genome there, export the project from the finishing program as CAF and finally use convert_project (from the MIRA package ) with the "-r" option to recalculate the optimal consensus of your finished project.
E.g., assuming you have just finished editing the gap4 database
DEMO.3
, do the following. First, export the gap4 database back to
CAF:
$
gap2caf -project DEMO -version 3 >demo3.caf
Then, use convert_project with option '-r' to convert it into any other format that you need. Example for converting to a CAF and a FASTA format with correct consensus:
$
convert_project -f caf -t caf -t fasta -r c demo3.caf final_result
mira cannot work with EXP files resulting from GAP4 that already have been edited. If you want to reassemble an edited GAP4 project, convert it to CAF format and use the [-caf] option to load.
As also explained earlier, mira relies on sequencing vector being recognised in preprocessing steps by other programs. Sometimes, when a whole stretch of bases is not correctly marked as sequencing vector, the reads might not be aligned into a contig although they might otherwise match quite perfectly. You can use [-CL:pvc] and [-CO:emea] to address problem with incomplete clipping of sequencing vectors. Also having the assembler work with less strict parameters may help out of this.
mira has been developed to assemble shotgun sequencing or EST sequencing data. There are no explicit limitations concerning length or number of sequences. However, there are a few implicit assumptions that were made while writing portions of the code:
Sequence data produced by electrophoresis rarely surpasses 1000 usable bases and I never heard of, let alone seen, more than 1100. The fast filtering SKIM relies on the fact that sequences will never exceed 10000 bases in length.
The next problem that might arise with 'unnatural' long sequence reads will be my implementation of the Smith-Waterman alignment routines. I use a banded version with linear running time (linear to the bandwidth) but quadratic space usage. So, comparing two 'reads' of length 5000 will result in memory usage of 100MB. I know that this could be considered as a flaw. On the other hand - unless someone comes up with electrophoresis producing reads with more than 2000 usable bases - I see no real need to change this as long as there are more important things on the TODO list. Of course, if anyone is willing to contribute a fast banded SW alignment routine which runs in linear time and space, just feel free to contact the author.
Current data structures allow for a worst case read coverage of maximally 16384 reads on top of the other.
Note: this limit was more than enough for about any kind of genome sequencing, but since people started to do sequencing of non-normalised EST libraries with 454 and Solexa, this limit can be reached all too often. This will change in future releases.
the 32-bit Linux version is limited by the memory made available by the Linux kernel (somewhere around 2.3 to 2.7GB).
to reduce memory overhead, the following assumptions have been made:
the 64-bit Linux version has no implicit memory limits, although the maximum number of bases of all reads may not surpass 2.147.483.648 bases. With that, even aliens with a genome size ~800 times bigger than humans could be tackled (if it were not for other limitations, mainly RAM and processing power).
mira is not fully multi-threaded (yet), but even Sanger projects for bigger bacteria can be assembled in ~2-3 hours on a current hardware platform. Fungi may take two or three days.
For 454 genome projects, bacteria should be done in about a day at most, Fungi could take about 10 days.
a project does not contain sequences from more than 255 different:
sequencing machine types
primers
strains (in mapping mode: 7)
base callers
dyes
process status
a project does not contain sequences from more than 65535 different
clone vectors
sequencing vectors
Note: Versions with uneven minor versions (e.g. 1.1.x, 1.3.x, ..., 2.1.x, ... etc.) are development versions which might be unstable in parts (although I don't think so). But to catch possible bugs, development versions of mira are distributed with tons of internal checks compiled into the code, making it somewhere between 10% and 50% slower than it could be.
Of course one can run MIRA atop a NFS mount (a "disk" mounted over a network using the NFS protocol), but the performance will go down the drain as the NFS server respectively the network will not be able to cope with the amount of data MIRA needs to shift to and from disk (writes/reads to the tmp directory). Slowdowns of a factor of 10 and more have been observed. In case you have no other possibility, you can force MIRA to run atop a NFS using [-MI:sonfs=no], but you have been warned.
In case you want to keep input and output files on NFS, you can use [-DI:trt] to redirect the tmp directory to a local filesystem. Then MIRA will run at almost full speed.
Assembling sequences without quality values is like ... like ... like driving a car downhill a sinuous mountain road at 200 km/h without brakes, airbags and no steering wheel. With a ravine on one side and a rock face on the other. Did I mention the missing seat-belts? You might get down safely, but experience tells the result will rather be a bloody mess.
All MIRA routines internally are geared toward quality values guiding decisions. No one should ever assembly anything without quality values. Never. Ever. Even if quality values are sometimes inaccurate, they do help.
Now, there are very rare occasions
where getting quality values is not possible. If you absolutely cannot
get them, and I mean only in this case, use these
switches:--noqualities[=SEQUENCINGTECHNOLOGY]
. E.g.:
SEQUENCINGTECHNOLOGY
_SETTINGS
-AS:bdq=30
--noqualities=454 454_SETTINGS -AS:bdq=30
or
--noqualities SANGER_SETTINGS -AS:bdq=30 454_SETTINGS -AS:bdq=30
This tells MIRA not to complain about missing quality values and to fake a quality value of 30 for all reads having no qualities, allowing some MIRA routines (in standard parameter settings) to start disentangling your repeats.
![]() | Warning |
---|---|
Doing the above has some severe side-effects. You will be, e.g., at the mercy of non-random sequencing errors. I suggest combining the above with a [-CO:mrpg=4] or higher. You also may want to tune the [-AS:bdq] parameter together with [-CO:mnq] and [-CO:mgqrt] in cases where you mix sequences with and without quality values. |
Viewing the results of a mira assembly or preprocessing the sequences for an assembly can be done with a number of different programs. The following ones are are just examples, there are a lot more packages available:
If you have really nothing else as viewer, a browser who understands tables is needed to view the HTML output. A browser knowing style sheets (CSS) is recommended, as different tags will be highlighted. Konqueror, Opera, Mozilla, Netscape and Internet Explorer all do fine, lynx is not really ... optimal.
You'll want GAP4 (generally speaking: the Staden package) to preprocess the sequences, visualise and eventually rework the results when using gap4da output. The Staden package comes with a fully featured sequence preparing and annotating engine (pregap4) that is very useful to preprocess your data (conversion between file types, quality clipping, tagging etc.).
See http://www.sourceforge.net/projects/staden/ for further information and also a possibility to download precompiled binaries for different platforms.
Reading result files from ssaha2 or smalt from the Sanger centre is supported directly by mira to perform a fast and efficient tagging of sequencing vector stretches. This makes you basically independent from any other commercial or license-requiring vector screening software. For Sanger reads, a combination of lucy (see below), ssaha2 or smalt together with the mira parameters for SSAHA2 / SMALT support ( [-CL:msvs]) and quality clipping ( [-CL:qc]) should do the trick. For reads coming from 454 pyro-sequencing, ssaha2 or smalt and the SSAHA2 / SMALT support also work pretty well.
See http://www.sanger.ac.uk/resources/software/ssaha2/ and / or http://www.sanger.ac.uk/resources/software/smalt/ for further information and also a possibility to download the source or precompiled binaries for different platforms.
lucy from TIGR (now JCVI) is another useful sequence preprocessing program. Lucy is a utility that prepares raw DNA sequence fragments for sequence assembly. The cleanup process includes quality assessment, confidence reassurance, vector trimming and vector removal.
There's a small script in the MIRA 3rd party package which converts the clipping data from the lucy format into something mira can understand (NCBI Traceinfo).
See ftp://ftp.tigr.org/pub/software/Lucy/ to download the source code of lucy.
Viewing .ace
file output without consed
can be done with clview from TIGR. See
http://www.tigr.org/tdb/tgi/software/.
Tablet http://bioinf.scri.ac.uk/tablet/ may also be used for this.
The Integrated Genome Browser (IGB) of the GenoViz project at SourceForge (http://sourceforge.net/projects/genoviz/) is just perfect for loading a genome and looking at mapping coverage (provided by the wiggle result files of MIRA).
TraceTuner (http://sourceforge.net/projects/tracetuner/) is a tool for base and quality calling of trace files from DNA sequencing instruments. Originally developed by Paracel, this code base was released as open source in 2006 by Celera.
phred (basecaller) - cross_match (sequence comparison and filtering) - phrap (assembler) - consed (assembly viewer and editor). This is another package that can be used for this type of job, but requires more programming work. The fact that sequence stretches are masked out (overwritten with the character X) if they shouldn't be used in an assembly doesn't really help and is considered harmful (but it works).
Note the bug of consed when reading ACE files, see more about this in the section on file types (above) in the entry for ACE.
See http://www.phrap.org/ for further information.
A text viewer for the different textual output files.
As always, most of the time a combination of several different packages is possible. My currently preferred combo for genome projects is ssaha2 or smalt and or lucy (vector screening), MIRA (assembly, of course) and gap4 (assembly viewing and finishing).
For re-assembling projects that were edited in gap4, one will also need the gap2caf converter. The source for this is available at http://www.sanger.ac.uk/resources/software/caf.html.
Since the V2.9.24x3 version of mira, there is miramem as program call. When called from the command line, it will ask a number of questions and then print out an estimate of the amount of RAM needed to assemble the project. Take this estimate with a grain of salt, depending on the sequences properties, variations in the estimate can be +/- 30% for bacteria and 'simple' eukaryotes. The higher the number of repeats is, the more likely you will need to restrict memory usage in some way or another.
Here's the transcript of a session with miramem:
This is MIRA V3.2.0rc1 (development version). Please cite: Chevreux, B., Wetter, T. and Suhai, S. (1999), Genome Sequence Assembly Using Trace Signals and Additional Sequence Information. Computer Science and Biology: Proceedings of the German Conference on Bioinformatics (GCB) 99, pp. 45-56. To (un-)subscribe the MIRA mailing lists, see: http://www.chevreux.org/mira_mailinglists.html After subscribing, mail general questions to the MIRA talk mailing list: mira_talk@freelists.org To report bugs or ask for features, please use the new ticketing system at: http://sourceforge.net/apps/trac/mira-assembler/ This ensures that requests don't get lost. [...] miraMEM helps you to estimate the memory needed to assemble a project. Please answer the questions below. Defaults are give in square brackets and chosen if you just press return. Hint: you can add k/m/g modifiers to your numbers to say kilo, mega or giga. Is it a genome or transcript (EST/tag/etc.) project? (g/e/) [g] g Size of genome? [4.5m]9.8m
9800000 Size of largest chromosome? [9800000] 9800000 Is it a denovo or mapping assembly? (d/m/) [d] d Number of Sanger reads? [0] 0 Are there 454 reads? (y/n/) [n]y
y Number of 454 GS20 reads? [0] 0 Number of 454 FLX reads? [0] 0 Number of 454 Titanium reads? [0]750k
750000 Are there PacBio reads? (y/n/) [n] n Are there Solexa reads? (y/n/) [n] n ************************* Estimates ************************* The contigs will have an average coverage of ~ 30.6 (+/- 10%) RAM estimates: reads+contigs (unavoidable): 7.0 GiB large tables (tunable): 688. MiB --------- total (peak): 7.7 GiB add if using -CL:pvlc=yes : 2.6 GiB Estimates may be way off for pathological cases. Note that some algorithms might try to grab more memory if the need arises and the system has enough RAM. The options for automatic memory management control this: -AS:amm, -AS:kpmf, -AS:mps Further switches that might reduce RAM (at cost of run time or accuracy): -SK:mhim, -SK:mchr (both runtime); -SK:mhpr (accuracy) *************************************************************
If your RAM is not large enough, you can still assemble projects by using disk swap. Up to 20% of the needed memory can be provided by swap without the speed penalty getting too large. Going above 20% is not recommended though, above 30% the machine will be almost permanently swapping at some point or another.
NEW since 2.7.4: The new SKIM3 algorithm (initial all-against-all read comparison) is now approximately 60 times faster that the SKIM algorithms of earlier versions. E.g. SKIMming of 53,000 Sanger type shotguns reads now takes a bit more than a minute instead of 62 minutes.
The times given below are only approximate and were gathered on my home development box (Athlon 4800+) using a single core and minimal debug code compiled in, somewhat slowing down the whole process.
Example 1: a small genomic project with 720 reads forming 35k bases of
contig
sequences. Using --job=denovo,genome,accurate,sanger
and resolving minor repeat misassemblies, full read extension and
automatic contig editing takes 19 seconds.
Example 2: a bacterial genome project with two very closely related
strains, 53000 Sanger reads forming a bit more than 3 megabases of
contig sequences for each strain. Using
the --job=denovo,genome,accurate,sanger
(four main
passes, read extension, clipping of vector remnants), resolving repeat
misassemblies (mostly RNA stretches, but also some very closely
related genes) takes 1hr and 48 minutes and uses a maximum of 1.2GB of
RAM (miramem estimated the usage to be 1.5GB).
Example 3: Here are the times for miraSearchESTSNPs in a non-normalised (thus very repetitive) EST project, 9747 reads with a average length of 674 used bases,
The fast filtering algorithm performs about 12 million sequence comparisons per second (8 seconds).
Banded Smith-Waterman performs around 750 sequence alignments per second (with a 15% band to each side, which is quite generous), 4:07 for about 182000 alignment checks.
The three steps of miraSearchESTSNPs (each one again subdivided in a number of MIRA passes), including resolving very high coverage contigs (>500 sequences) in multiple passes and splitting them into different SNP and splice variants takes about 20 minutes.
File Input / Output:
mira can only read unedited EXP files.
There sometimes is a (rather important) memory leak occurring while using the assembly integrated Sanger read editor. I have not been able to trace the reason yet.
There's an unexpected bug for MacOS which leads to rubbish assemblies on large data sets which assemble totally fine with Linux versions of MIRA. I have not been able to find out the reason for this yet.
Assembly process:
The routines for determining Repeat Marker Bases (SRMr) are sometimes too sensitive, which sometimes leads to excessive base tagging and preventing right assemblies in subsequent assembly processes. The parameters you should look at for this problem are [-CO:mrc:nrz:mgqrt:mgqwpc]. Also look at [-CL:pvc] and [-CO:emea] if you have a lot of sequencing vector relics at the end of the sequences.
The assignment of reads to debris or singlets and whether or not they are put into the final result is messy, the statistic numbers about this sometimes even wrong. Needs to be redone.
These are some of the topics on my TODO list for the next revisions to come:
Making parts of the process multi-threaded (currently stopped due to other priorities like Solexa etc.)
Less disk usage when using EST assembly on 10 or more million Solexa reads
Others nifty ideas that I have not completely thought out yet.
Note: description is old and needs to be adapted to the current 2.9.x / 3.x line.
To avoid the "garbage-in, garbage-out" problematic, mira uses a 'high quality alignments first' contig building strategy. This means that the assembler will start with those regions of sequences that have been marked as good quality (high confidence region - HCR) with low error probabilities (the clipping must have been done by the base caller or other preprocessing programs, e.g. pregap4) and then gradually extends the alignments as errors in different reads are resolved through error hypothesis verification and signal analysis.
This assembly approach relies on some of the automatic editing functionality provided by the EdIt package which has been integrated in parts within mira.
This is an approximate overview on the steps that are executed while assembling:
All the experiment / phd / fasta sequences that act as input are loaded (or the CAF project). Qualities for the bases are loaded from the FASTA or SCF if needed.
the ends of the reads are cleaned ensure they have a minimum stretch of bases without sequencing errors
The high confidence region (HCR) of each read is compared with a quick algorithm to the HCR of every other read to see if it could match and have overlapping parts (this is the 'SKIM' filter).
All the reads which could match are being checked with an adapted Smith-Waterman alignment algorithm (banded version). Obvious mismatches are rejected, the accepted alignments form one or several alignment graphs.
Optional pre-assembly read extension step: mira tries to extend HCR of reads by analysing the read pairs from the previous alignment. This is a bit shaky as reads in this step have not been edited yet, but it can help. Go back to step 2.
A contig gets made by building a preliminary partial path through the alignment graph (through in-depth analysis up to a given level) and then adding the most probable overlap candidates to a given contig. Contigs may reject reads if these introduce to many errors in the existing consensus. Errors in regions known as dangerous (for the time being only ALUS and REPT) get additional attention by performing simple signal analysis when alignment discrepancies occur.
Optional: the contig can be analysed and corrected by the automatic editor ("EdIt" for Sanger reads, or the new MIRA editor for 454 reads).
Long repeats are searched for, bases in reads of different repeats that have been assembled together but differ sufficiently (for EdIT so that they didn't get edited and by phred quality value) get tagged with special tags (SRMr and WRMr).
Go back to step 5 if there are reads present that have not been assembled into contigs.
Optional: Detection of spoiler reads that prevent joining of contigs. Remedy by shortening them.
Optional: Write out a checkpoint assembly file and go back to step 2.
The resulting project is written out to different output files and directories.