Working with the results of MIRA

Bastien Chevreux

MIRA Version 3.4.1.1

Document revision $Id$

Table of Contents

1. MIRA output directories and files
1.1. The *_d_results directory
1.2. The *_d_info directory
2. First look: the assembly info
3. Converting results
3.1. Converting to and from gap4
3.2. Converting to and from gap5
3.3. Converting to and from other formats: convert_project
4. Filtering results
5. Finishing and data analysis: finding places of importance in the assembly
5.1. Tags set by MIRA
5.2. Other places of importance
5.3. Joining contigs
5.3.1. Joining contigs at true repetitive sites
5.3.2. Joining contigs at "wrongly discovered" repetitive sites
 

You have to know what you're looking for before you can find it.

 
 --Solomon Short

MIRA makes results available in quite a number of formats: CAF, ACE, FASTA and a few others. The preferred formats are CAF and MAF, as these format can be translated into any other supported format.

1.  MIRA output directories and files

For the assembly MIRA creates a directory named projectname_assembly in which a number of sub-directories will have appeared.

[Note]Note
The projectname is determined by the mira parameter --project=... or, if used, the specific --proout=... parameter.

These sub-directories (and files within) contain the results of the assembly itself, general information and statistics on the results and -- if not deleted automatically by MIRA -- a tmp directory with log files and temporary data:

  • projectname_d_results: this directory contains all the output files of the assembly in different formats.

  • projectname_d_info: this directory contains information files of the final assembly. They provide statistics as well as, e.g., information (easily parseable by scripts) on which read is found in which contig etc.

  • projectname_d_tmp: this directory contains log files and temporary assembly files. It can be safely removed after an assembly as there may be easily a few GB of data in there that are not normally not needed anymore.

    The default settings of MIRA are such that really big files are automatically deleted when they not needed anymore during an assembly.

  • projectname_d_chkpt: this directory contains checkpoint files needed to resume assemblies that crashed or were stopped (not implemented yet, but soon)

1.1.  The *_d_results directory

The following files in projectname_d_results contain results of the assembly in different formats. Depending on the output options of MIRA, some files may or may not be there. As long as the CAF or MAF format are present, you can translate your assembly later on to about any supported format with the convert_project program supplied with the MIRA distribution:

  • projectname_out.txt: this file contains in a human readable format the aligned assembly results, where all input sequences are shown in the context of the contig they were assembled into. This file is just meant as a quick way for people to have a look at their assembly without specialised alignment finishing tools.

  • projectname_out.padded.fasta: this file contains as FASTA sequence the consensus of the contigs that were assembled in the process. Positions in the consensus containing gaps (also called 'pads', denoted by an asterisk) are still present. The computed consensus qualities are in the corresponding projectname_out.padded.fasta.qual file.

  • projectname_out.unpadded.fasta: as above, this file contains as FASTA sequence the consensus of the contigs that were assembled in the process, put positions in the consensus containing gaps were removed. The computed consensus qualities are in the corresponding projectname_out.unpadded.fasta.qual file.

  • projectname_out.caf: this is the result of the assembly in CAF format, which can be further worked on with, e.g., tools from the caftools package from the Sanger Centre and later on be imported into, e.g., the Staden gap4 assembly and finishing tool.

  • projectname_out.ace: this is the result of the assembly in ACE format. This format can be read by viewers like the TIGR clview or by consed from the phred/phrap/consed package.

  • projectname_out.gap4da: this directory contains the result of the assembly suited for the direct assembly import of the Staden gap4 assembly viewer and finishing tool.

1.2.  The *_d_info directory

The following files in projectname_info contain statistics and other information files of the assembly:

  • projectname_info_assembly.txt: This file should be your first stop after an assembly. It will tell you some statistics as well as whether or not problematic areas remain in the result.

  • projectname_info_callparameters.txt: This file contains the parameters as given on the mira command line when the assembly was started.

  • projectname_info_consensustaglist.txt: This file contains information about the tags (and their position) that are present in the consensus of a contig.

  • projectname_info_contigreadlist.txt: This file contains information which reads have been assembled into which contigs (or singlets).

  • projectname_info_contigstats.txt: This file contains in tabular format statistics about the contigs themselves, their length, average consensus quality, number of reads, maximum and average coverage, average read length, number of A, C, G, T, N, X and gaps in consensus.

  • projectname_info_debrislist.txt: This file contains the names off all the reads which were not assembled into contigs (or singlets if appropriate MIRA parameters were chosen).

  • projectname_info_readrepeats: This file helps to find out which parts of which reads are quite repetitive in a project. Please consult the chapter on how to tackle "hard" sequencing projects to learn how this file can help you in spotting sequencing mistakes and / or difficult parts in a genome or EST / RNASeq project.

  • projectname_info_readstooshort: A list containing the names of those reads that have been sorted out of the assembly only due to the fact that they were too short, before any processing started.

  • projectname_info_readtaglist.txt: This file contains information about the tags and their position that are present in each read. The read positions are given relative to the forward direction of the sequence (i.e. as it was entered into the the assembly).

  • projectname_error_reads_invalid: A list of sequences that have been found to be invalid due to various reasons (given in the output of the assembler).

2.  First look: the assembly info

Once finished, have a look at the file *_info_assembly.txt in the info directory. The assembly information given there is split in three major parts:

  1. some general assembly information (number of reads assembled etc.). This part is quite short at the moment, will be expanded in future

  2. assembly metrics for 'large' contigs.

  3. assembly metrics for all contigs.

The first part for large contigs contains several sections. The first of these shows what MIRA counts as large contig for this particular project. As example, this may look like this:

Large contigs:
--------------
With    Contig size             >= 500
        AND (Total avg. Cov     >= 19
             OR Cov(san)        >= 0
             OR Cov(454)        >= 8
             OR Cov(pbs)        >= 0
             OR Cov(sxa)        >= 11
             OR Cov(sid)        >= 0
            )

The above is for a 454 and Solexa hybrid assembly in which MIRA determined large contigs to be contigs

  1. of length of at least 500 bp and

  2. having a total average coverage of at least 19x or an average 454 coverage of 8 or an average Solexa coverage of 11

The second section is about length assessment of large contigs:

  Length assessment:
  ------------------
  Number of contigs:    44
  Total consensus:      3567224
  Largest contig:       404449
  N50 contig size:      186785
  N90 contig size:      55780
  N95 contig size:      34578

In the above example, 44 contigs totalling 3.56 megabases were built, the largest contig being 404 kilobases long and the N50/N90 and N95 numbers give the respective lengths.

The next section shows information about the coverage assessement of large contigs. An example:

  Coverage assessment:
  --------------------
  Max coverage (total): 563
  Max coverage
        Sanger: 0
        454:    271
        PacBio: 0
        Solexa: 360
        Solid:  0
  Avg. total coverage (size >= 5000): 57.38
  Avg. coverage (contig size >= 5000)
        Sanger: 0.00
        454:    25.10
        PacBio: 0.00
        Solexa: 32.88
        Solid:  0.00

Maximum coverage attained was 563, maximum for 454 alone 271 and for Solexa alone 360. The average total coverage (computed from contigs with a size ≥ 5000 bases is 57.38. The average coverage by sequencing technology (in contigs ≥ 5000) is 25.10 for 454 and 32.88 for Solexa reads.

[Note]Note
The value for "Avg. total coverage (size >= 5000)" is currently always calculated for contig having 5000 or mor consensus bases. While this gives a very effective measure for genome assemblies, EST assemblies will often have totally irrelevant values here as most genes in eukaryotes (and prokaryotes) tend to be smaller than 5000 bases.

The last section contains some numbers useful for quality assessment. It looks like this:

  Quality assessment:
  -------------------
  Average consensus quality:                    90
  Consensus bases with IUPAC:                   11      (you might want to check these)
  Strong unresolved repeat positions (SRMc):    0       (excellent)
  Weak unresolved repeat positions (WRMc):      19      (you might want to check these)
  Sequencing Type Mismatch Unsolved (STMU):     0       (excellent)
  Contigs having only reads wo qual:            0       (excellent)
  Contigs with reads wo qual values:            0       (excellent)

Beside the average quality of the contigs and whether they contain reads without quality values, MIRA shows the number of different tags in the consensus which might point at problems.

The above mentioned sections (length assessemnt, coverage assessment and quality assessment) for large contigs will then be re-iterated for all contigs, this time including also contigs which MIRA did not take into account as large contig.

3.  Converting results

3.1.  Converting to and from gap4

The gap4 program from the Staden package is a pretty useful finishing tool and assembly viewer. It has an own database format which MIRA does not read or write, but there are interconversion possibilities using the CAF format and the caf2gap and gap2caf utilities.

Conversion is pretty straightforward. From MIRA to gap4, it's like this:

$ caf2gap -project YOURGAP4PROJECTNAME -ace mira_result.caf >&/dev/null
[Note]Note
Don't be fooled by the -ace parameter of caf2gap. It needs a CAF file as input, not an ACE file.

From gap4 to CAF, it's like this:

$ gap2caf -project YOURGAP4PROJECTNAME >tmp.caf
$ convert_project -f caf -t caf -r c tmp.caf somenewname
[Note]Note
Using gap2caf, be careful to use the simple > redirection to file and not the >& redirection.
[Note]Note
Using first gap2caf and then convert_project is needed as gap4 writes an own consensus to the CAF file which is not necessarily the best. Indeed, gap4 does not know about different sequencing technologies like 454 and treats everything as Sanger. Therefore, using convert_project with the [-r c] option recalculates a MIRA consensus during the "conversion" from CAF to CAF.

3.2.  Converting to and from gap5

The gap5 program is the successor for gap4. It comes with on own import utility (tg_index) which can read CAF files, and gap5 itself can export to CAF.

Conversion is pretty straightforward. From MIRA to gap5, it's like this:

$ tg_index -C YOURGAP4PROJECTNAME_out.caf

This creates a gap5 database named YOURGAP4PROJECTNAME_out.g5d which can be directly loaded with gap5 like this:

$ gap5 YOURGAP4PROJECTNAME_out.g5d

3.3.  Converting to and from other formats: convert_project

convert_project is tool in the MIRA package which reads and writes a number of formats, ranging from full assembly formats like CAF and MAF to simple output view formats like HTML or plain text.

Please read the chapter on MIRA utilities in this manual to learn more on convert_project and also have a look at convert_project -h which lists all possible formats and other command line options.

4.  Filtering results

It is important to remember that some assembly options of mira improve the overall assembly while increasing the number of contig debris, i.e. small contigs with low coverage that can probably be discarded. One infamous option is the option to use uniform read distribution ( [-AS:urd]) which helps to reconstruct identical repeats across multiple locations in the genome but as a side effect, some redundant reads will end up as typical contig debris. You probably do not want to have a look at contig debris when finishing a genome unless you are really, really, really picky.

By default, the result files of MIRA contain everything which might play a role in automatic assembly post-processing pipelines as most sequencing centers have implemented.

Many people prefer to just go on with what would be large contigs. Therefore the convert_project program from the MIRA package can selectively filter CAF or MAF files for contigs with a certain size, average coverage or number of reads.

The file *_info_assembly.txt in the info directory at the end of an assembly might give you first hints on what could be suitable filter parameters. For example, in assemblies being made with a in a normal (whatever this means) fashion I routinely only consider contigs larger than 500 bases and have at least one third of the average coverage of the N50 contigs.

Here's an example: In the "Large contigs" section, there's a "Coverage assessment" subsection. It looks a bit like this:

...
Coverage assessment:
--------------------
Max coverage (total): 43
Max coverage
Sanger: 0
454:    43
Solexa: 0
Solid:  0
Avg. total coverage (size ≥ 5000): 22.30
Avg. coverage (contig size ≥ 5000)
Sanger: 0.00
454:    22.05
Solexa: 0.00
Solid:  0.00
...

This project was obviously a 454 only project, and the average coverage for it is ~22. This number was estimated by MIRA by taking only contigs of at least 5kb into account, which for sure left out everything which could be categorised as debris. It's a pretty solid number.

Now, depending on how much time you want to invest performing some manual polishing, you should extract contigs which have at least the following fraction of the average coverage:

  • 2/3 if a quick and "good enough" is what you want and you don't want to do some manual polishing. In this example, that would be around 14 or 15.

  • 1/2 if you want to have a "quick look" and eventually perform some contig joins. In this example the number would be 11.

  • 1/3 if you want quite accurate and for sure not loose any possible repeat. That would be 7 or 8 in this example.

Example (useful with assemblies of Sanger data): extracting only contigs ≥ 1000 bases and with a minimum average coverage of 4 into FASTA format:

$ convert_project -f caf -t fasta -x 1000 -y 4 sourcefile.caf targetfile.fasta

Example (useful with assemblies of 454 data): extracting only contigs ≥ 500 bases into FASTA format:

$ convert_project -f caf -t fasta -x 500 sourcefile.caf targetfile.fasta

Example (e.g. useful with Sanger/454 hybrid assemblies): extracting only contigs ≥ 500 bases and with an average coverage ≥ 15 reads into CAF format, then converting the reduced CAF into a Staden GAP4 project:

$ convert_project -f caf -t caf -x 500 -y 15 sourcefile.caf tmp.caf
$ caf2gap -project somename -ace tmp.caf

Example (e.g. useful with Sanger/454 hybrid assemblies): extracting only contigs ≥ 1000 bases and with ≥ 10 reads from MAF into CAF format, then converting the reduced CAF into a Staden GAP4 project:

$ convert_project -f maf -t caf -x 500 -z 10 sourcefile.maf tmp
$ caf2gap -project somename -ace tmp.caf

Start convert_project with the -h option for help on available options.

5.  Finishing and data analysis: finding places of importance in the assembly

5.1.  Tags set by MIRA

MIRA sets a number of different tags in resulting assemblies. They can be set in reads (in which case they mostly end with a r) or in the consensus.(then ending with a c).

If you use the Staden gap4 or consed assembly editor to tidy up the assembly, you can directly jump to places of interest that MIRA marked for further analysis by using the search functionality of these programs.

You should search for the following "consensus" tags for finding places of importance (in this order).

  • IUPc

  • UNSc

  • SRMc

  • WRMc

  • STMU (only hybrid assemblies)

  • MCVc (only when assembling different strains, i.e., mostly relevant for mapping assemblies)

  • SROc (only when assembling different strains, i.e., mostly relevant for mapping assemblies)

  • SAOc (only when assembling different strains, i.e., mostly relevant for mapping assemblies)

  • SIOc (only when assembling different strains, i.e., mostly relevant for mapping assemblies)

  • STMS (only hybrid assemblies)

of lesser importance are the "read" versions of the tags above:

  • UNSr

  • SRMr

  • WRMr

  • SROr (only when assembling different strains, i.e., mostly relevant for mapping assemblies)

  • SAOr (only when assembling different strains, i.e., mostly relevant for mapping assemblies)

  • SIOr (only when assembling different strains, i.e., mostly relevant for mapping assemblies)

In normal assemblies (only one sequencing technology, just one strain), search for the IUPc, UNSc, SRMc and WRMc tags.

In hybrid assemblies, searching for the IUPc, UNSc, SRMc, WRMc, and STMU tags and correcting only those places will allow you to have a qualitatively good assembly in no time at all.

Columns with SRMr tags (SRM in Reads) in an assembly without a SRMc tag at the same consensus position show where mira was able to resolve a repeat during the different passes of the assembly ... you don't need to look at these. SRMc and WRMc tags however mean that there may be unresolved trouble ahead, you should take a look at these.

Especially in mapping assemblies, columns with the MCVc, SROx, SIOx and SAOx tags are extremely helpful in finding places of interest. As they are only set if you gave strain information to MIRA, you should always do that.

For more information on tags set/used by MIRA and what they exactly mean, please look up the according section in the reference chapter.

5.2.  Other places of importance

The read coverage histogram as well as the template display of gap4 will help you to spot other places of potential interest. Please consult the gap4 documentation.

5.3.  Joining contigs

I recommend to invest a couple of minutes (in the best case) to a few hours in joining contigs, especially if the uniform read distribution option of MIRA was used (but first filter for large contigs). This way, you will reduce the number of "false repeats" in improve the overall quality of your assembly.

5.3.1.  Joining contigs at true repetitive sites

Joining contigs at repetitive sites of a genome is always a difficult decision. There are, however, two rules which can help:

  1. If the sequencing was done without a paired-end library, don't join.
  2. If the sequencing was done with a paired-end library, but no pair (or template) span the join site, don't join.

The following screenshot shows a case where one should not join as the finishing program (in this case gap4) warns that no template (read-pair) span the join site:

Figure 1.  Join at a repetitive site which should not be performed due to missing spanning templates.

Join at a repetitive site which should not be performed due to missing spanning templates.


The next screenshot shows a case where one should join as the finishing program (in this case gap4) finds templates spanning the join site and all of them are good:

Figure 2.  Join at a repetitive site which should be performed due to spanning templates being good.

Join at a repetitive site which should be performed due to spanning templates being good.


5.3.2.  Joining contigs at "wrongly discovered" repetitive sites

Remember that MIRA takes a very cautious approach in contig building, and sometimes creates two contigs when it could have created one. Three main reasons can be the cause for this:

  1. when using uniform read distribution, some non-repetitive areas may have generated so many more reads that they start to look like repeats (so called pseudo-repeats). In this case, reads that are above a given coverage are shaved off (see [-AS:urdcm] and kept in reserve to be used for another copy of that repeat ... which in case of a non-repetitive region will of course never arrive. So at the end of an assembly, these shaved-off reads will form short, low coverage contig debris which can more or less be safely ignored and sorted out via the filtering options ( [-x -y -z]) of convert_project.

    Some 454 library construction protocols -- especially, but not exclusively, for paired-end reads -- create pseudo-repeats quite frequently. In this case, the pseudo-repeats are characterised by several reads starting at exact the same position but which can have different lengths. Should MIRA have separated these reads into different contigs, these can be -- most of the time -- safely joined. The following figure shows such a case:

    Figure 3.  Pseudo-repeat in 454 data due to sequencing artifacts

    Pseudo-repeat in 454 data due to sequencing artifacts

    For Solexa data, a non-negligible GC bias has been reported in genome assemblies since late 2009. In genomes with moderate to high GC, this bias actually favours regions with lower GC. Examples were observed where regions with an average GC of 10% less than the rest of the genome had between two and four times more reads than the rest of the genome, leading to false "discovery" of duplicated genome regions.

  2. when using unpaired data, the above described possibility of having "too many" reads in a non-repetitive region can also lead to a contig being separated into two contigs in the region of the pseudo-repeat.

  3. a number of reads (sometimes even just one) can contain "high quality garbage", that is, nonsense bases which got - for some reason or another - good quality values. This garbage can be distributed on a long stretch in a single read or concern just a single base position across several reads.

    While MIRA has some algorithms to deal with the disrupting effects of reads like, the algorithms are not always 100% effective and some might slip through the filters.