Table of Contents
“You have to know what you're looking for before you can find it. ” | ||
--Solomon Short |
MIRA makes results available in quite a number of formats: CAF, ACE, FASTA and a few others. The preferred formats are CAF and MAF, as these format can be translated into any other supported format.
For the assembly MIRA creates a directory named
in
which a number of sub-directories will have appeared.
projectname
_assembly
![]() | Note |
---|---|
The is
determined by the mira parameter --project=... or,
if used, the specific --proout=... parameter.
|
These sub-directories (and files within) contain the results of the assembly itself, general information and statistics on the results and -- if not deleted automatically by MIRA -- a tmp directory with log files and temporary data:
:
this directory contains all the output files of the assembly in
different formats.
projectname
_d_results
:
this directory contains information files of the final
assembly. They provide statistics as well as, e.g., information
(easily parseable by scripts) on which read is found in which
contig etc.
projectname
_d_info
:
this directory contains log files and temporary assembly files. It
can be safely removed after an assembly as there may be easily a
few GB of data in there that are not normally not needed anymore.
projectname
_d_tmp
The default settings of MIRA are such that really big files are automatically deleted when they not needed anymore during an assembly.
:
this directory contains checkpoint files needed to resume
assemblies that crashed or were stopped (not implemented yet, but
soon)
projectname
_d_chkpt
The following files in
contain results of the assembly in different formats. Depending on the
output options of MIRA, some files may or may not be there. As long as
the CAF or MAF format are present, you can translate your assembly
later on to about any supported format with the
convert_project program supplied with the MIRA
distribution:
projectname
_d_results
:
this file contains in a human readable format the aligned assembly
results, where all input sequences are shown in the context of the
contig they were assembled into. This file is just meant as a
quick way for people to have a look at their assembly without
specialised alignment finishing tools.
projectname
_out.txt
:
this file contains as FASTA sequence the consensus of the contigs
that were assembled in the process. Positions in the consensus
containing gaps (also called 'pads', denoted by an asterisk) are
still present. The computed consensus qualities are in the
corresponding
projectname
_out.padded.fasta
file.
projectname
_out.padded.fasta.qual
:
as above, this file contains as FASTA sequence the consensus of
the contigs that were assembled in the process, put positions in
the consensus containing gaps were removed. The computed consensus
qualities are in the corresponding
projectname
_out.unpadded.fasta
file.
projectname
_out.unpadded.fasta.qual
:
this is the result of the assembly in CAF format, which can be
further worked on with, e.g., tools from the
caftools package from the Sanger Centre and
later on be imported into, e.g., the Staden gap4 assembly and
finishing tool.
projectname
_out.caf
:
this is the result of the assembly in ACE format. This format can
be read by viewers like the TIGR clview or by consed from the
phred/phrap/consed package.
projectname
_out.ace
:
this directory contains the result of the assembly suited for the
direct assembly import of the Staden gap4
assembly viewer and finishing tool.
projectname
_out.gap4da
The following files in
contain statistics and other information files of the assembly:
projectname
_info
:
This file should be your first stop after an assembly. It will
tell you some statistics as well as whether or not problematic
areas remain in the result.
projectname
_info_assembly.txt
:
This file contains the parameters as given on the mira command
line when the assembly was started.
projectname
_info_callparameters.txt
:
This file contains information about the tags (and their position)
that are present in the consensus of a contig.
projectname
_info_consensustaglist.txt
:
This file contains information which reads have been assembled
into which contigs (or singlets).
projectname
_info_contigreadlist.txt
:
This file contains in tabular format statistics about the contigs
themselves, their length, average consensus quality, number of
reads, maximum and average coverage, average read length, number
of A, C, G, T, N, X and gaps in consensus.
projectname
_info_contigstats.txt
:
This file contains the names off all the reads which were not
assembled into contigs (or singlets if appropriate MIRA parameters
were chosen).
projectname
_info_debrislist.txt
:
This file helps to find out which parts of which reads are quite
repetitive in a project. Please consult the chapter on how to
tackle "hard" sequencing projects to learn how this file can help
you in spotting sequencing mistakes and / or difficult parts in a
genome or EST / RNASeq project.
projectname
_info_readrepeats
:
A list containing the names of those reads that have been sorted
out of the assembly only due to the fact that they were too short,
before any processing started.
projectname
_info_readstooshort
:
This file contains information about the tags and their position
that are present in each read. The read positions are given
relative to the forward direction of the sequence (i.e. as it was
entered into the the assembly).
projectname
_info_readtaglist.txt
:
A list of sequences that have been found to be invalid due to
various reasons (given in the output of the assembler).
projectname
_error_reads_invalid
Once finished, have a look at the file
*_info_assembly.txt
in the info directory. The
assembly information given there is split in three major parts:
some general assembly information (number of reads assembled etc.). This part is quite short at the moment, will be expanded in future
assembly metrics for 'large' contigs.
assembly metrics for all contigs.
The first part for large contigs contains several sections. The first of these shows what MIRA counts as large contig for this particular project. As example, this may look like this:
Large contigs: -------------- With Contig size >= 500 AND (Total avg. Cov >= 19 OR Cov(san) >= 0 OR Cov(454) >= 8 OR Cov(pbs) >= 0 OR Cov(sxa) >= 11 OR Cov(sid) >= 0 )
The above is for a 454 and Solexa hybrid assembly in which MIRA determined large contigs to be contigs
of length of at least 500 bp and
having a total average coverage of at least 19x or an average 454 coverage of 8 or an average Solexa coverage of 11
The second section is about length assessment of large contigs:
Length assessment: ------------------ Number of contigs: 44 Total consensus: 3567224 Largest contig: 404449 N50 contig size: 186785 N90 contig size: 55780 N95 contig size: 34578
In the above example, 44 contigs totalling 3.56 megabases were built, the largest contig being 404 kilobases long and the N50/N90 and N95 numbers give the respective lengths.
The next section shows information about the coverage assessement of large contigs. An example:
Coverage assessment: -------------------- Max coverage (total): 563 Max coverage Sanger: 0 454: 271 PacBio: 0 Solexa: 360 Solid: 0 Avg. total coverage (size >= 5000): 57.38 Avg. coverage (contig size >= 5000) Sanger: 0.00 454: 25.10 PacBio: 0.00 Solexa: 32.88 Solid: 0.00
Maximum coverage attained was 563, maximum for 454 alone 271 and for Solexa alone 360. The average total coverage (computed from contigs with a size ≥ 5000 bases is 57.38. The average coverage by sequencing technology (in contigs ≥ 5000) is 25.10 for 454 and 32.88 for Solexa reads.
![]() | Note |
---|---|
The value for "Avg. total coverage (size >= 5000)" is currently always calculated for contig having 5000 or mor consensus bases. While this gives a very effective measure for genome assemblies, EST assemblies will often have totally irrelevant values here as most genes in eukaryotes (and prokaryotes) tend to be smaller than 5000 bases. |
The last section contains some numbers useful for quality assessment. It looks like this:
Quality assessment: ------------------- Average consensus quality: 90 Consensus bases with IUPAC: 11 (you might want to check these) Strong unresolved repeat positions (SRMc): 0 (excellent) Weak unresolved repeat positions (WRMc): 19 (you might want to check these) Sequencing Type Mismatch Unsolved (STMU): 0 (excellent) Contigs having only reads wo qual: 0 (excellent) Contigs with reads wo qual values: 0 (excellent)
Beside the average quality of the contigs and whether they contain reads without quality values, MIRA shows the number of different tags in the consensus which might point at problems.
The above mentioned sections (length assessemnt, coverage assessment and quality assessment) for large contigs will then be re-iterated for all contigs, this time including also contigs which MIRA did not take into account as large contig.
The gap4 program from the Staden package is a pretty useful finishing tool and assembly viewer. It has an own database format which MIRA does not read or write, but there are interconversion possibilities using the CAF format and the caf2gap and gap2caf utilities.
Conversion is pretty straightforward. From MIRA to gap4, it's like this:
$
caf2gap -projectYOURGAP4PROJECTNAME
-acemira_result.caf
>&/dev/null
![]() | Note |
---|---|
Don't be fooled by the -ace parameter of
caf2gap. It needs a CAF file as input, not an ACE
file.
|
From gap4 to CAF, it's like this:
$
gap2caf -projectYOURGAP4PROJECTNAME
>tmp.caf$
convert_project -f caf -t caf -r c tmp.cafsomenewname
![]() | Note |
---|---|
Using gap2caf, be careful to use the simple
> redirection to file and
not the >& redirection.
|
![]() | Note |
---|---|
Using first gap2caf and then convert_project is needed as gap4 writes an own consensus to the CAF file which is not necessarily the best. Indeed, gap4 does not know about different sequencing technologies like 454 and treats everything as Sanger. Therefore, using convert_project with the [-r c] option recalculates a MIRA consensus during the "conversion" from CAF to CAF. |
The gap5 program is the successor for gap4. It comes with on own import utility (tg_index) which can read CAF files, and gap5 itself can export to CAF.
Conversion is pretty straightforward. From MIRA to gap5, it's like this:
$
tg_index -CYOURGAP4PROJECTNAME
_out.caf
This creates a gap5 database named
which can be directly loaded with gap5 like this:
YOURGAP4PROJECTNAME
_out.g5d
$
gap5YOURGAP4PROJECTNAME
_out.g5d
convert_project is tool in the MIRA package which reads and writes a number of formats, ranging from full assembly formats like CAF and MAF to simple output view formats like HTML or plain text.
Please read the chapter on MIRA utilities in this manual to learn more
on convert_project and also have a look at
convert_project -h
which lists all possible formats
and other command line options.
It is important to remember that some assembly options of mira improve the overall assembly while increasing the number of contig debris, i.e. small contigs with low coverage that can probably be discarded. One infamous option is the option to use uniform read distribution ( [-AS:urd]) which helps to reconstruct identical repeats across multiple locations in the genome but as a side effect, some redundant reads will end up as typical contig debris. You probably do not want to have a look at contig debris when finishing a genome unless you are really, really, really picky.
By default, the result files of MIRA contain everything which might play a role in automatic assembly post-processing pipelines as most sequencing centers have implemented.
Many people prefer to just go on with what would be large contigs. Therefore the convert_project program from the MIRA package can selectively filter CAF or MAF files for contigs with a certain size, average coverage or number of reads.
The file *_info_assembly.txt
in the info directory
at the end of an assembly might give you first hints on what could be
suitable filter parameters. For example, in assemblies being made with a
in a normal (whatever this means) fashion I routinely only consider
contigs larger than 500 bases and have at least one third of the average
coverage of the N50 contigs.
Here's an example: In the "Large contigs" section, there's a "Coverage assessment" subsection. It looks a bit like this:
... Coverage assessment: -------------------- Max coverage (total): 43 Max coverage Sanger: 0 454: 43 Solexa: 0 Solid: 0 Avg. total coverage (size ≥ 5000): 22.30 Avg. coverage (contig size ≥ 5000) Sanger: 0.00 454: 22.05 Solexa: 0.00 Solid: 0.00 ...
This project was obviously a 454 only project, and the average coverage for it is ~22. This number was estimated by MIRA by taking only contigs of at least 5kb into account, which for sure left out everything which could be categorised as debris. It's a pretty solid number.
Now, depending on how much time you want to invest performing some manual polishing, you should extract contigs which have at least the following fraction of the average coverage:
2/3 if a quick and "good enough" is what you want and you don't want to do some manual polishing. In this example, that would be around 14 or 15.
1/2 if you want to have a "quick look" and eventually perform some contig joins. In this example the number would be 11.
1/3 if you want quite accurate and for sure not loose any possible repeat. That would be 7 or 8 in this example.
Example (useful with assemblies of Sanger data): extracting only contigs ≥ 1000 bases and with a minimum average coverage of 4 into FASTA format:
$
convert_project -f caf -t fasta -x 1000 -y 4
sourcefile.caf targetfile.fasta
Example (useful with assemblies of 454 data): extracting only contigs ≥ 500 bases into FASTA format:
$
convert_project -f caf -t fasta -x 500
sourcefile.caf targetfile.fasta
Example (e.g. useful with Sanger/454 hybrid assemblies): extracting only contigs ≥ 500 bases and with an average coverage ≥ 15 reads into CAF format, then converting the reduced CAF into a Staden GAP4 project:
$
convert_project -f caf -t caf -x 500 -y 15
sourcefile.caf tmp.caf
$
caf2gap -project
somename
-acetmp.caf
Example (e.g. useful with Sanger/454 hybrid assemblies): extracting only contigs ≥ 1000 bases and with ≥ 10 reads from MAF into CAF format, then converting the reduced CAF into a Staden GAP4 project:
$
convert_project -f maf -t caf -x 500 -z 10
sourcefile.maf tmp
$
caf2gap -project
somename
-acetmp.caf
Start convert_project with the -h option for help on available options.
MIRA sets a number of different tags in resulting assemblies. They can be set in reads (in which case they mostly end with a r) or in the consensus.(then ending with a c).
If you use the Staden gap4 or consed assembly editor to tidy up the assembly, you can directly jump to places of interest that MIRA marked for further analysis by using the search functionality of these programs.
You should search for the following "consensus" tags for finding places of importance (in this order).
IUPc
UNSc
SRMc
WRMc
STMU (only hybrid assemblies)
MCVc (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
SROc (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
SAOc (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
SIOc (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
STMS (only hybrid assemblies)
of lesser importance are the "read" versions of the tags above:
UNSr
SRMr
WRMr
SROr (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
SAOr (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
SIOr (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
In normal assemblies (only one sequencing technology, just one strain), search for the IUPc, UNSc, SRMc and WRMc tags.
In hybrid assemblies, searching for the IUPc, UNSc, SRMc, WRMc, and STMU tags and correcting only those places will allow you to have a qualitatively good assembly in no time at all.
Columns with SRMr tags (SRM in Reads) in an assembly without a SRMc tag at the same consensus position show where mira was able to resolve a repeat during the different passes of the assembly ... you don't need to look at these. SRMc and WRMc tags however mean that there may be unresolved trouble ahead, you should take a look at these.
Especially in mapping assemblies, columns with the MCVc, SROx, SIOx and SAOx tags are extremely helpful in finding places of interest. As they are only set if you gave strain information to MIRA, you should always do that.
For more information on tags set/used by MIRA and what they exactly mean, please look up the according section in the reference chapter.
The read coverage histogram as well as the template display of gap4 will help you to spot other places of potential interest. Please consult the gap4 documentation.
I recommend to invest a couple of minutes (in the best case) to a few hours in joining contigs, especially if the uniform read distribution option of MIRA was used (but first filter for large contigs). This way, you will reduce the number of "false repeats" in improve the overall quality of your assembly.
Joining contigs at repetitive sites of a genome is always a difficult decision. There are, however, two rules which can help:
The following screenshot shows a case where one should not join as the finishing program (in this case gap4) warns that no template (read-pair) span the join site:
Figure 1. Join at a repetitive site which should not be performed due to missing spanning templates.
![]() |
The next screenshot shows a case where one should join as the finishing program (in this case gap4) finds templates spanning the join site and all of them are good:
Remember that MIRA takes a very cautious approach in contig building, and sometimes creates two contigs when it could have created one. Three main reasons can be the cause for this:
when using uniform read distribution, some non-repetitive areas may have generated so many more reads that they start to look like repeats (so called pseudo-repeats). In this case, reads that are above a given coverage are shaved off (see [-AS:urdcm] and kept in reserve to be used for another copy of that repeat ... which in case of a non-repetitive region will of course never arrive. So at the end of an assembly, these shaved-off reads will form short, low coverage contig debris which can more or less be safely ignored and sorted out via the filtering options ( [-x -y -z]) of convert_project.
Some 454 library construction protocols -- especially, but not exclusively, for paired-end reads -- create pseudo-repeats quite frequently. In this case, the pseudo-repeats are characterised by several reads starting at exact the same position but which can have different lengths. Should MIRA have separated these reads into different contigs, these can be -- most of the time -- safely joined. The following figure shows such a case:
For Solexa data, a non-negligible GC bias has been reported in genome assemblies since late 2009. In genomes with moderate to high GC, this bias actually favours regions with lower GC. Examples were observed where regions with an average GC of 10% less than the rest of the genome had between two and four times more reads than the rest of the genome, leading to false "discovery" of duplicated genome regions.
when using unpaired data, the above described possibility of having "too many" reads in a non-repetitive region can also lead to a contig being separated into two contigs in the region of the pseudo-repeat.
a number of reads (sometimes even just one) can contain "high quality garbage", that is, nonsense bases which got - for some reason or another - good quality values. This garbage can be distributed on a long stretch in a single read or concern just a single base position across several reads.
While MIRA has some algorithms to deal with the disrupting effects of reads like, the algorithms are not always 100% effective and some might slip through the filters.