Solexa sequence assembly with MIRA3

Bastien Chevreux

MIRA Version 3.4.1.1

Document revision $Id$

Table of Contents

1. Introduction
2. Caveats when assembling Solexa data
3. Typical highlights and lowlights of Solexa sequencing data
3.1. Highlights
3.1.1. Quality
3.1.2. Improved base calling pipeline of Illumina
3.2. Lowlights
3.2.1. Long homopolymers
3.2.2. The GGCxG and GGC motifs
3.2.3. Strong GC bias in some Solexa data (2nd half 2009 until advent of TruSeq kit at end of 2010)
4. Mapping assemblies
4.1. Copying and naming the sequence data
4.2. Copying and naming the reference sequence
4.3. Starting a mapping assembly: unpaired data
4.4. Assembling with multiple strains
4.5. Starting a mapping assembly: paired-end data
4.6. Places of interest in a mapping assembly
4.6.1. Where are SNPs?
4.6.2. Where are insertions, deletions or genome re-arrangements?
4.6.3. Other tags of interest
4.6.4. Comprehensive spreadsheet tables (for Excel or OOcalc)
4.6.5. HTML files depicting SNP positions and deletions
4.6.6. WIG files depicting contig coverage
4.7. Walkthrough: mapping of E.coli from Lenski lab against E.coli B REL606
4.7.1. Getting the data
4.7.2. Preparing the data for an assembly
4.7.3. Starting the mapping assembly
4.7.4. Looking at results
4.7.5. Post-processing with gap4 and re-exporting to MIRA
4.7.6. Converting mapping results into HTML and simple spreadsheet tables for biologists
5. De-novo Solexa only assemblies
5.1. Without paired-end
5.2. With paired-end (only one library size)
5.3. With paired-end (several library sizes)
6. De-novo hybrid assemblies (Solexa + ...)
6.1. All reads de-novo
6.1.1. Starting the assembly
6.2. Long reads first, then Solexa
6.2.1. Step 1: assemble the 'long' reads (454 or Sanger or both)
6.2.2. Step 2: filter the results
6.2.3. Step 3: map the Solexa data
7. Post-processing of assemblies
7.1. Post-processing mapping assemblies
7.2. Post-processing de-novo assemblies
8. Known bugs / problems
 

There is no such thing like overkill.

 
 --Solomon Short

Notes of caution:

  1. this guide is still not finished (and may contain old information regarding read lengths in parts), but is should cover most basic use cases.

  2. you need lots of memory ... ~ 1 to 1.5 GiB per million Solexa reads. Using mira for anything more than 50 to 100 million Solexa reads is probably not a good idea.

1.  Introduction

This guide assumes that you have basic working knowledge of Unix systems, know the basic principles of sequencing (and sequence assembly) and what assemblers do.

While there are step by step instructions on how to setup your Solexa data and then perform an assembly, this guide expects you to read at some point in time

  • the MIRA reference manual file to look up some command line options as well as general information on what tags MIRA uses in assemblies, files it generates etc.pp

  • the short usage introduction to MIRA3 so that you have a basic knowledge on how to set up projects in mira for Sanger sequencing projects.

2.  Caveats when assembling Solexa data

Even very short Solexa reads (< 50bp) are great for mapping assemblies. I simply love them as you can easily spot differences in mutant organisms ... or boost the quality of a newly sequenced genome to Q60.

Regarding de-novo assemblies ... well, from an assembler's point of view, very short reads are a catastrophe, regardless of the sequencing technology.

  1. Repeats. The problem of repetitive sequences (e.g. rRNA stretches in bacteria) gets worse the shorter the read lengths get.

  2. Amount of data. As mira is by heart an assembler to resolve difficult repetitive problems as they occur in Sanger and 454 reads, it drags along quite a lot of ancillary information which is useless in Solexa assemblies ... but still eats away memory

Things look better for the now available 'longer' Solexa reads. Starting with a length of 75bp and paired-end data, de-novo for bacteria is not that bad at all. The first Solexas with a length of ~110 bases are appearing in public, and from a contig building these are about as good for de-novo as the first 454 GS20 reads were.

Here's the rule of thumb I use: the longer, the better. If you have to pay a bit more to get longer reads (e.g. Solexa 100mers instead of 75mers), go get the longer reads. With these, the results you generate are way(!) better than with 36, 50 or even 75mers ... both in mapping and de-novo. Don't try to save a couple of hundred bucks in sequencing, you'll pay dearly afterwards in assembly.

3.  Typical highlights and lowlights of Solexa sequencing data

Note: This section contains things I've seen in the past and simply jotted down. You may have different observations.

3.1.  Highlights

3.1.1.  Quality

For 36mers and the MIRA proposed-end-clipping, even in the old pipeline I get about 90 to 95% reads matching to a reference without a single error. For 72mers, the number is approximately 5% lower, 100mers another 5% less. Still, these are great numbers.

3.1.2.  Improved base calling pipeline of Illumina

The new base calling pipeline (1.4 or 2.4?) rolled out by Illumina in Q1/Q2 2009 typically yields 20-50% more data from the very same images. Furthermore, the base calling is way better than in the old pipeline. For Solexa 76 mers, after trimming I get only 1% real junk, between 85 and 90% of the reads are matching to a reference without a single error. Of the remaining reads, roughly 50% have one error, 25% have two errors, 12.5% have three errors etc.

It is worthwhile to re-analyse your old data if the images are still around.

3.2.  Lowlights

3.2.1.  Long homopolymers

Long homopolymers (stretches of identical bases in reads) can be a slight problem for Solexa. However, it must be noted that this is a problem of all sequencing technologies on the market so far (Sanger, Solexa, 454). Furthermore, the problem in much less pronounced in Solexa than in 454 data: in Solexa, first problem appear may appear in stretches of 9 to 10 bases, in 454 a stretch of 3 to 4 bases may already start being problematic in some reads.

3.2.2.  The GGCxG and GGC motifs

GGCxG or even GGC motif in the 5' to 3' direction of reads. This one is particularly annoying and it took me quite a while to circumvent in MIRA the problems it causes.

Simply put: at some places in a genome, base calling after a GGCxG or GGC motif is particularly error prone, the number of reads without errors declines markedly. Repeated GGC motifs worsen the situation. The following screenshots of a mapping assembly illustrate this.

The first example is a the GGCxG motif (in form of a GGCTG) occuring in approximately one third of the reads at the shown position. Note that all but one read with this problem are in the same (plus) direction.

Figure 1.  The Solexa GGCxG problem.

The Solexa GGCxG problem.

The next two screenshots show the GGC, once for forward direction and one with reverse direction reads:

Figure 2.  The Solexa GGC problem, forward example

The Solexa GGC problem, forward example

Figure 3.  The Solexa GGC problem, reverse example

The Solexa GGC problem, reverse example

Places in the genome that have GGCGGC.....GCCGCC (a motif, perhaps even repeated, then some bases and then an inverted motif) almost always have very, very low number of good reads. Especially when the motif is GGCxG.

Things get especially difficult when these motifs occur at sites where users may have a genuine interest. The following example is a screenshot from the Lenski data (see walk-through below) where a simple mapping reveals an anomaly which -- in reality -- is an IS insertion (see http://www.nature.com/nature/journal/v461/n7268/fig_tab/nature08480_F1.html) but could also look like a GGCxG motif in forward direction (GGCCG) and at the same time a GGC motif in reverse direction:

Figure 4.  A genuine place of interest almost masked by the GGCxG problem.

A genuine place of interest almost masked by the GGCxG problem.

3.2.3.  Strong GC bias in some Solexa data (2nd half 2009 until advent of TruSeq kit at end of 2010)

Here I'm recycling a few slides from a couple of talks I held in 2010.

Things used to be so nice and easy with the early Solexa data I worked with (36 and 44mers) in late 2007 / early 2008. When sample taking was done right -- e.g. for bacteria: in stationary phase -- and the sequencing lab did a good job, the read coverage of the genome was almost even. I did see a few papers claiming to see non-trivial GC bias back then, but after having analysed the data I worked with I dismissed them as "not relevant for my use cases." Have a look at the following figure showing exemplarily the coverage of a 45% GC bacterium in 2008:

Figure 5.  Example for no GC coverage bias in 2008 Solexa data. Apart from a slight smile shape of the coverage -- indicating the sample taking was not 100% in stationary phase of the bacterial culture -- everything looks pretty nice: the average coverage is at 27x, and when looking at potential genome duplications at twice the coverage (54x), there's nothing apart a single peak (which turned out to be a problem in a rRNA region).

Example for no GC coverage bias in 2008 Solexa data. Apart from a slight smile shape of the coverage -- indicating the sample taking was not 100% in stationary phase of the bacterial culture -- everything looks pretty nice: the average coverage is at 27x, and when looking at potential genome duplications at twice the coverage (54x), there's nothing apart a single peak (which turned out to be a problem in a rRNA region).

Things changed starting somewhen in Q3 2009, at least that's when I got some data which made me notice a problem. Have a look at the following figure which shows exactly the same organism as in the figure above (bacterium, 45% GC):

Figure 6.  Example for GC coverage bias starting Q3 2009 in Solexa data. There's no smile shape anymore -- the people in the lab learned to pay attention to sample in 100% stationary phase -- but something else is extremely disconcerting: the average coverage is at 33x, and when looking at potential genome duplications at twice the coverage (66x), there are several dozen peaks crossing the 66x threshold over a several kilobases (in one case over 200 Kb) all over the genome. As if several small genome duplications happened.

Example for GC coverage bias starting Q3 2009 in Solexa data. There's no smile shape anymore -- the people in the lab learned to pay attention to sample in 100% stationary phase -- but something else is extremely disconcerting: the average coverage is at 33x, and when looking at potential genome duplications at twice the coverage (66x), there are several dozen peaks crossing the 66x threshold over a several kilobases (in one case over 200 Kb) all over the genome. As if several small genome duplications happened.

By the way, the figures above are just examples: I saw over a dozen sequencing projects in 2008 without GC bias and several dozen in 2009 / 2010 with GC bias.

Checking the potential genome duplication sites, they all looked "clean", i.e., the typical genome insertion markers are missing. Poking around at possible explanations, I looked at GC content of those parts in the genome ... and there was the explanation:

Figure 7.  Example for GC coverage bias, direct comparison 2008 / 2010 data. The bug has 45% average GC, areas with above average read coverage in 2010 data turn out to be lower GC: around 33 to 36%. The effect is also noticeable in the 2008 data, but barely so.

Example for GC coverage bias, direct comparison 2008 / 2010 data. The bug has 45% average GC, areas with above average read coverage in 2010 data turn out to be lower GC: around 33 to 36%. The effect is also noticeable in the 2008 data, but barely so.

Now as to actually why the GC bias suddenly became so strong is unknown to me. The people in the lab use the same protocol since several years to extract the DNA and the sequencing providers claim to always use the Illumina standard protocols.

But obviously something must have changed. Current ideas about possoble reasons include

  • changed chemistries from Illumina leading perhaps to bias during DNA amplification
  • changed "standard" protocols
  • other ...

It took Illumina some 18 months to resolve that problem for the broader public: since data I work on were done with the TruSeq kit, this problem has vanished.

However, if you based some conclusions or wrote a paper with Illumina data which might be affected by the GC bias (Q3 2009 to Q4 2010), I suggest you rethink all the conclusion drawn. This should be especially the case for transcriptomics experiments where a difference in expression of 2x to 3x starts to get highly significant!

4.  Mapping assemblies

This part will introduce you step by step how to get your data together for a simple mapping assembly.

I'll make up an example using an imaginary bacterium: Bacillus chocorafoliensis (or short: Bchoc).

In this example, we assume you have two strains: a wild type strain of Bchoc_wt and a mutant which you perhaps got from mutagenesis or other means. Let's imagine that this mutant needs more time to eliminate a given amount of chocolate, so we call the mutant Bchoc_se ... SE for slow eater

You wanted to know which mutations might be responsible for the observed behaviour. Assume the genome of Bchoc_wt is available to you as it was published (or you previously sequenced it), so you resequenced Bchoc_se with Solexa to examine mutations.

4.1.  Copying and naming the sequence data

You need to create (or get from your sequencing provider) the sequencing data in either FASTQ or FASTA + FASTA quality format. The following walkthrough uses what most people nowadays get: FASTQ.

Put the FASTQ data into an empty directory and rename the file so that it looks like this:

arcadia:/path/to/myProject$ ls -l
-rw-r--r-- 1 bach users 263985896 2008-03-28 21:49 bchocse_in.solexa.fastq

4.2.  Copying and naming the reference sequence

The reference sequence (the backbone) can be in a number of different formats: FASTA, GenBank, CAF. The later two have the advantage of being able to carry additional information like, e.g., annotation. In this example, we will use a GenBank file like the ones one can download from the NCBI. So, let's assume that our wild type strain is in the following file: NC_someNCBInumber.gbk. Copy this file to the directory (you may also set a link), renaming it as bchocse_backbone_in.gbf.

arcadia:/path/to/myProject$ cp /somewhere/NC_someNCBInumber.gbk bchocse_backbone_in.gbf
arcadia:/path/to/myProject$ ls -l
-rw-r--r-- 1 bach users   6543511 2008-04-08 23:53 bchocse_backbone_in.gbf
-rw-r--r-- 1 bach users 263985896 2008-03-28 21:49 bchocse_in.solexa.fastq

4.3.  Starting a mapping assembly: unpaired data

Starting the assembly is now just a matter of a simple command line with some parameters set correctly. The following is an example of what I use when mapping onto a reference sequence in GenBank format:

arcadia:/path/to/myProject$ mira
  --project=bchocse --job=mapping,genome,accurate,solexa
  -AS:nop=1
  -SB:bsn=bchoc_wt:bft=gbf:bbq=30
  SOLEXA_SETTINGS
  -SB:ads=yes:dsn=bchocse
  >&log_assembly.txt
[Note]Note 1

The above command has been split in multiple lines for better overview but should be entered in one line.

[Note]Note 2

Please look up the parameters used in the main manual. The ones above basically say: make an accurate mapping of Solexa reads against a genome; in one pass; the name of the backbone strain is 'bchoc_wt'; the file type containing backbone is a GenBank file; the base qualities for the backbone are to be assumed Q30; for Solexa data: assign default strain names for reads which have not loaded ancilarry data with strain info and that default strain name should be 'bchocse'.

[Note]Note 3

For a bacterial project having a backbone of ~4 megabases and with ~4.5 million Solexa 36mers, MIRA needs some ~21 minutes on my development machine.

A yeast project with a genome of ~20 megabases and ~20 million 36mers needs 3.5 hours and 28 GiB RAM.

For this example - if you followed the walk-through on how to prepare the data - everything you might want to adapt in the first time are the following options:

  • -project (for naming your assembly project)

  • -SB:bsn to give the backbone strain (your reference strain) another name

  • -SB:bft to load the backbone sequence from another file type, say, a FASTA

  • -SB:dsn to give a the Solexa reads another strain name

Of course, you are free to change any option via the extended parameters, but this will be the topic of another FAQ.

4.4.  Assembling with multiple strains

MIRA will make use of ancillary information when present. The strain name is such an ancillary information. That is, we can tell MIRA the strain of each read we use in the assembly. In the example above, this information was given on the command line as all the reads to be mapped had the same strain information. But what to do if one wants to map reads from several strains?

We could generate a TRACEINFO XML file with all bells and whistles, but for strain data there's an easier way: the straindata file. It's a simple key-value file, one line per entry, with the name of the read as key (first entry in line) and, separated by a blank the name of the strain as value (second entry in line). E.g.:

1_1_207_113 strain1
1_1_61_711  strain1
1_1_182_374 strain2
...
2_1_13_654 strain2
...

Etcetera. You will obviously replace 'strain1' and 'strain2' with your strain names.

This file can be quickly generated automatically, using the extracted names from FASTQ files and rewritten a little bit. Here's how:

arcadia:/path/to/myProject$ ls -l
-rw-r--r-- 1 bach users 494282343 2008-03-28 22:11 bchocse_in.solexa.fastq

arcadia:/path/to/myProject$ grep "^@" bchocse_in.solexa.fastq 
  | sed -e 's/@//' 
  | cut -f 1
  | cut -f 1 -d ' '
  |  sed -e 's/$/ bchocse/'
  > bchocse_straindata_in.txt

arcadia:/path/to/myProject$ ls -l
-rw-r--r-- 1 bach users 494282343 2008-03-28 22:11 bchocse_in.solexa.fastq
-rw-r--r-- 1 bach users 134822451 2008-03-28 22:13 bchocse_straindata_in.txt
[Note]Note 1

The above command has been split in multiple lines for better overview but should be entered in one line.

[Note]Note 2

for larger files, this can run a minute or two.

[Note]Note 3

As you can also assemble sequences from more that one strain, the read names in bchocse_straindata_in.txt can have different strain names attached to them. You will then need to generate one straindata file from multiple FASTQ files.

This creates the needed data in the file bchocse_straindata_in.txt (well, it's one way to do it, feel free to use whatever suits you best).

4.5.  Starting a mapping assembly: paired-end data

When using paired-end data, you must decide whether you want

  1. use the MIRA feature to create long 'coverage equivalent reads' (CERs) which saves a lot of memory (both in the assembler and later on in an assembly editor). However, you then loose paired-end information!

  2. or whether you want to keep paired-end information! at the expense of larger memory requirements both in MIRA and in assembly editors afterwards.

The Illumina pipeline generally gives you two files for paired-end data: a project-1.fastq and project-2.fastq. The first file containing the first read of a read-pair, the second file the second read. Depending on the preprocessing pipeline of your sequencing provider, the names of the reads can be either the very same in both files or already have a /1 or /2 appended.

[Note]Note
For running MIRA, you must concatenate all sequence input files into one file.

If the read names do not follow the /1/2 scheme, you must obviously rename them in the process. A ltlle sed command can do this automatically for you. Assuming your reads all have the prefix SRR_something_ the following line appends /1 to all lines which begin with @SRR_something_

arcadia:/path/to/myProject$ sed -e 's/^@SRR_something_/&\/1/' input.fastq >output.fastq

If you don't care about the paired-end information, you can start the mapping assembly exactly like an assembly for data without paired-end info (see section above).

In case you want to keep the paired-end information, here's the command line (again an example when mapping against a GenBank reference file, assuming that the library insert size is ~500 bases):

arcadia:/path/to/myProject$ mira 
  --project=bchocse --job=mapping,genome,accurate,solexa
  -AS:nop=1
  -SB:lsd=yes:bsn=bchoc_wt:bft=gbf:bbq=30
  SOLEXA_SETTINGS 
  -CO:msr=no -GE:uti=no:tismin=250:tismax=750 
  -SB:ads=yes:dsn=bchocse
  >&log_assembly.txt
[Note]Note 1

For this example to work, make sure that the read pairs are named using the Solexa standard, i.e., having '/1' as postfix to the name of one read and '/2' for the other read. If yours have a different naming scheme, look up the -LR:rns parameter in the main documentation.

[Note]Note 2

Please look up the parameters used in the main manual. The ones above basically say: make an accurate mapping of Solexa reads against a genome, in one pass, load additional strain data, the name of the backbone is 'bchoc_wt', the file type containing backbone is a GenBank file, the base qualities for the backbone are to assumed Q30. Additionally, only for Solexa reads, do not merge short reads to the contig, use template size information and set minimum and maximum expected distance to 250 and 750 respectively.

[Note]Note 3

You will want to use other values than 250 and 750 if your Solexa paired-end library was not with insert sizes of approximately 500 bases.

Comparing this command line with a command line for unpaired-data, two parameters were added in the section for Solexa data:

  1. -CO:msr=no tells MIRA not to merge reads that are 100% identical to the backbone. This also allows to keep the template information for the reads.

  2. -GE:uti=no actually switches off checking of template sizes when inserting reads into the backbone. At first glance this might seem counter-intuitive, but it's absolutely necessary to spot, e.g., genome re-arrangements or indels in data analysis after the assembly.

    The reason is that if template size checking were on, the following would happen at, e.g. sites of re-arrangement: MIRA would map the first read of a read-pair without problem. However, it would very probably reject the second read because it would not map at the specified distance from its partner. Therefore, in mapping assemblies with paired-end data, checking of the template size must be switched off.

  3. -GE:tismin:tismax were set to give the maximum and minimum distance paired-end reads may be away from each other. Though the information is not used by MIRA in the assembly itself, the information is stored in result files and can be used afterwards by analysis programs which search for genome re-arrangements.

Note: for other influencing factors you might want to change depending on size of Solexa reads, see section above on mapping of unpaired data.

4.6.  Places of interest in a mapping assembly

This section just give a short overview on the tags you might find interesting. For more information, especially on how to configure gap4 or consed, please consult the mira usage document and the mira manual.

In file types that allow tags (CAF, MAF, ACE), SNPs and other interesting features will be marked by MIRA with a number of tags. The following sections give a brief overview. For a description of what the tags are (SROc, WRMc etc.), please read up the section "Tags used in the assembly by MIRA and EdIt" in the main manual.

[Note]Note
Screenshots in this section are taken from the walk-through with Lenski data (see below).

4.6.1.  Where are SNPs?

  • the SROc tag will point to most SNPs. Should you assemble sequences of more than one strain (I cannot really recommend such a strategy), you also might encounter SIOc and SAOc tags.

    Figure 8.  "SROc" tag showing a SNP position in a Solexa mapping assembly.

    "SROc" tag showing a SNP position in a Solexa mapping assembly.

    Figure 9.  "SROc" tag showing a SNP/indel position in a Solexa mapping assembly.

    "SROc" tag showing a SNP/indel position in a Solexa mapping assembly.

  • the WRMc tags might sometimes point SNPs to indels of one or two bases.

4.6.2.  Where are insertions, deletions or genome re-arrangements?

  • Large deletions: the MCVc tags point to deletions in the resequenced data, where no read is covering the reference genome.

    Figure 10.  "MCVc" tag (dark red stretch in figure) showing a genome deletion in Solexa mapping assembly.

    "MCVc" tag (dark red stretch in figure) showing a genome deletion in Solexa mapping assembly.

  • Insertions, small deletions and re-arrangements: these are harder to spot. In unpaired data sets they can be found looking at clusters of SROc, SRMc, WRMc, and / or UNSc tags.

    Figure 11.  An IS150 insertion hiding behind a WRMc and a SRMc tags

    An IS150 insertion hiding behind a WRMc and a SRMc tags

    more massive occurences of these tags lead to a rather colourful display in finishing programs, which is why these clusters are also sometimes called Xmas-trees.

    Figure 12.  A 16 base pair deletion leading to a SROc/UNsC xmas-tree

    A 16 base pair deletion leading to a SROc/UNsC xmas-tree

    Figure 13.  An IS186 insertion leading to a SROc/UNsC xmas-tree

    An IS186 insertion leading to a SROc/UNsC xmas-tree

    In sets with paired-end data, post-processing software (or alignment viewers) can use the read-pair information to guide you to these sites (MIRA doesn't set tags at the moment).

4.6.3.  Other tags of interest

  • the UNSc tag points to areas where the consensus algorithm had troubles choosing a base. This happens in low coverage areas, at places of insertions (compared to the reference genome) or sometimes also in places where repeats with a few bases difference are present. Often enough, these tags are in areas with problematic sequences for the Solexa sequencing technology like, e.g., a GGCxG or even GGC motif in the reads.

  • the SRMc tag points to places where repeats with a few bases difference are present. Here too, sequence problematic for the Solexa technology are likely to have cause base calling errors and subsequently setting of this tag.

4.6.4.  Comprehensive spreadsheet tables (for Excel or OOcalc)

Biologists are not really interested in SNPs coordinates, and why should they? They're more interested where SNPs are, how good they are, which genes or other elements they hit, whether they have an effect on a protein sequence, whether they may be important etc. For organisms without intron/exon structure or splice variants, MIRA can generate pretty comprehensive tables and files if an annotated GenBank file was used as reference and strain information was given to MIRA during the assembly.

Well, MIRA does all that automatically for you if the reference sequence you gave was annotated.

For this, convert_project should be used with the asnp format as target and a CAF file as input:

$ convert_project -f caf -t asnp input.caf output

Note that it is strongly suggested to perform a quick manual cleanup of the assembly prior to this: for rare cases (mainly at site of small indels of one or two bases), mira will not tag SNPs with a SNP tag (SROc, SAOc or SIOc) but will be fooled into a tag denoting unsure positions (UNSc). This can be quickly corrected manually. See further down in this manual in the section on post-processing.

After conversion, you will have four files in the directory which you can all drag-and-drop into spreadsheet applications like OpenOffice Calc or Excel.

The files should be pretty self-explanatory, here's just a short overview:

  1. output_info_snplist.txt is a simple list of the SNPs, with their positions compared to the reference sequence (in bases and map degrees on the genome) as well as the GenBank features they hit.

  2. output_info_featureanalysis.txt is a much extended version of the list above. It puts the SNPs into context of the features (proteins, genes, RNAs etc.) and gives a nice list, SNP by SNP, what might cause bigger changes in proteins.

  3. output_info_featuresummary.txt looks at the changes (SNPs, indels) from the other way round. It gives an excellent overview which features (genes, proteins, RNAs, intergenic regions) you should investigate.

    There's one column (named 'interesting') which pretty much summarises up everything you need into three categories: yes, no, and perhaps. 'Yes' is set if indels were detected, an amino acid changed, start or stop codon changed or for SNPs in intergenic regions and RNAs. 'Perhaps' is set for SNPs in proteins that change a codon, but not an amino acid (silent SNPs). 'No' is set if no SNP is hitting a feature.

  4. output_info_featuresequences.txt simply gives the sequences of each feature of the reference sequence and the resequenced strain.

4.6.5.  HTML files depicting SNP positions and deletions

I've come to realise that people who don't handle data from NextGen sequencing technologies on a regular basis (e.g., many biologists) don't want to be bothered with learning to handle specialised programs to have a look at their resequenced strains. Be it because they don't have time to learn how to use a new program or because their desktop is not strong enough (CPU, memory) to handle the data sets.

Something even biologist know to operate are browsers. Therefore, convert_project has the option to load a CAF file of a mapping assembly at output to HTML those areas which are interesting to biologists. It uses the tags SROc, SAOc, SIOc and MCVc and outputs the surrounding alignment of these areas together with a nice overview and links to jump from one position to the previous or next.

This is done with the '-t hsnp' option of convert_project:

$ convert_project -f caf -t hsnp input.caf output

Note: I recommend doing this only if the resequenced strain is a very close relative to the reference genome, else the HTML gets pretty big. But for a couple of hundred SNPs it works great.

4.6.6.  WIG files depicting contig coverage

convert_project can also dump a coverage file in WIG format (using '-t wig'). This comes pretty handy for searching genome deletions or duplications in programs like the Affymetrix Integrated Genome Browser (IGB, see http://igb.bioviz.org/).

4.7.  Walkthrough: mapping of E.coli from Lenski lab against E.coli B REL606

We're going to use data published by Richard Lenski in his great paper "Genome evolution and adaptation in a long-term experiment with Escherichia coli". This shows how MIRA finds all mutations between two strains and how one would need just a few minutes to know which genes are affected.

[Note]Note
All steps described in this walkthrough are present in ready-to-be-run scripts in the solexa3_lenski demo directory of the MIRA package.
[Note]Note
This walkthrough takes a few detours which are not really necessary, but show how things can be done: it reduces the number of reads, it creates a strain data file etc. Actually, the whole demo could be reduced to two steps: downloading the data (naming it correctly) and starting the assembly with a couple of parameters.

4.7.1.  Getting the data

We'll use the reference genome E.coli B REL606 to map one of the strains from the paper. For mapping, I picked strain REL8593A more or less at random. All the data needed is fortunately at the NCBI, let's go and grab it:

  1. the NCBI has REL606 named NC_012967. We'll use the RefSeq version and the GenBank formatted file you can download from ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_B_REL606/NC_012967.gbk

  2. the Solexa re-sequencing data you can get from ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX012/SRX012992/. Download both FASTQ files, SRR030257_1.fastq.gz and SRR030257_2.fastq.gz.

    If you want more info regarding these data sets, have a look at http://www.ncbi.nlm.nih.gov/sra/?db=sra&term=SRX012992&report=full

4.7.2.  Preparing the data for an assembly

In this section we will setup the directory structure for the assembly and pre-process the data so that MIRA can start right away.

Let's start with setting up a directory structure. Remember: you can setup the data almost any way you like, this is just how I do things.

I normally create a project directory with three sub-directories: origdata, data, and assemblies. In origdata I put the files exactly as I got them from the sequencing or data provider, without touching them and even remowing write permissions to these files so that they cannot be tampered with. After that, I pre-process them and put the pre-processed files into data. Pre-processing can be a lot of things, starting from having to re-format the sequences, or renaming them, perhaps also doing clips etc. Finally, I use these pre-processed data in one or more assembly runs in the assemblies directory, perhaps trying out different assembly options.

arcadia:/some/path/$ mkdir lenskitest
arcadia:/some/path/$ cd lenskitest
arcadia:/some/path/lenskitest$ mkdir data origdata assemblies
arcadia:/some/path/lenskitest$ ls -l
drwxr-xr-x 2 bach bach 4096 2009-12-06 16:06 assemblies
drwxr-xr-x 2 bach bach 4096 2009-12-06 16:06 data
drwxr-xr-x 2 bach bach 4096 2009-12-06 16:06 origdata

Now copy the files you just downloaded into the directory origdata.

arcadia:/some/path/lenskitest$ cp /wherever/the/files/are/SRR030257_1.fastq.gz origdata
arcadia:/some/path/lenskitest$ cp /wherever/the/files/are/SRR030257_2.fastq.gz origdata
arcadia:/some/path/lenskitest$ cp /wherever/the/files/are/NC_012967.gbk origdata
arcadia:/some/path/lenskitest$ ls -l origdata
-rw-r--r-- 1 bach bach  10543139 2009-12-06 16:38 NC_012967.gbk
-rw-r--r-- 1 bach bach 158807975 2009-12-06 15:15 SRR030257_1.fastq.gz
-rw-r--r-- 1 bach bach 157595587 2009-12-06 15:21 SRR030257_2.fastq.gz
	

Great, let's preprocess the data. For this you must know a few things:

  • the standard Illumina naming scheme for Solexa paired-end reads is to append forward read names with /1 and reverse read names with /2. The reads are normally put into at least two different files (one for forward, one for reverse). Now, the Solexa data stored in the Short Read Archive at the NCBI also has forward and reverse files for paired-end Solexas. That's OK. What's a bit less good is that the read names there DO NOT have /1 appended to names of forward read, or /2 to names of reverse reads. The forward and reverse reads in both files are just named exactly the same. We'll need to fix that.

  • while Sanger and 454 reads should be preprocessed (clipping sequencing vectors, perhaps quality clipping etc.), reads from Solexa present do not. Some people perform quality clipping or clipping of reads with too many 'N's in the sequence, but this is not needed when using MIRA. In fact, MIRA will perform everything needed for Solexa reads itself and will generally do a much better job as the clipping performed is independent of Solexa quality values (which are not always the most trustworthy ones).

  • for a mapping assembly, it's good to give the strain name of the backbone and the strain name for the reads mapped against. The former can be done via command line, the later is done for each read individually in a key-value file (the straindata file).

So, to pre-process the data, we will need to

  • put the reads of the NCBI forward and reverse pairs into one file

  • append /1 to the names of forward reads, and /2 for reverse reads.

  • create a straindata file for MIRA

To ease things for you, I've prepared a small script which will do everything for you: copy and rename the reads as well as creating strain names. Note that it's a small part of a more general script which I use to sometimes sample subsets of large data sets, but for the Lenski data set is small enough so that everything is taken.

Create a file prepdata.sh in directory data and copy paste the following into it:

######################################################################
#######
####### Prepare paired-end Solexa downloaded from NCBI
#######
######################################################################

# srrname:    is the SRR name as downloaded form NCBI SRA
# numreads:   maximum number of forward (and reverse) reads to take from
#              each file. Just to avoid bacterial projects with a coverage
#              of 200 or so.
# strainname: name of the strain which was re-sequenced

srrname="SRR030257"
numreads=5000000
strainname="REL8593A"

################################

numlines=$((4*${numreads}))

# put "/1" Solexa reads into file
echo "Copying ${numreads} reads from _1 (forward reads)"
zcat ../origdata/${srrname}_1.fastq.gz | head -${numlines} | sed -e 's/SRR[0-9.]*/&\/1/' >${strainname}-${numreads}_in.solexa.fastq

# put "/2" Solexa reads into file
echo "Copying ${numreads} reads from _2 (reverse reads)"
zcat ../origdata/${srrname}_2.fastq.gz | head -${numlines} | sed -e 's/SRR[0-9.]*/&\/2/' >>${strainname}-${numreads}_in.solexa.fastq

# make file with strainnames
echo "Creating file with strain names for copied reads (this may take a while)."
grep "@SRR" ${strainname}-${numreads}_in.solexa.fastq | cut -f 1 -d ' ' | sed -e 's/@//' -e "s/$/ ${strainname}/" >>${strainname}-${numreads}_straindata_in.txt

Now, let's create the needed data:

arcadia:/some/path/lenskitest$ cd data
arcadia:/some/path/lenskitest/data$ ls -l
-rw-r--r-- 1 bach bach       1349 2009-12-06 17:05 prepdata.sh
arcadia:/some/path/lenskitest/data$ sh prepdata.sh
Copying 5000000 reads from _1 (forward reads)
Copying 5000000 reads from _2 (reverse reads)
Creating file with strain names for copied reads (this may take a while).
arcadia:/some/path/lenskitest/data$ ls -l
-rw-r--r-- 1 bach bach       1349 2009-12-06 17:05 prepdata.sh
-rw-r--r-- 1 bach bach 1553532192 2009-12-06 15:36 REL8593A-5000000_in.solexa.fastq
-rw-r--r-- 1 bach bach  218188232 2009-12-06 15:36 REL8593A-5000000_straindata_in.txt

Last step, just for the sake of completeness, link in the GenBank formatted file of the reference strain, giving it the same base name so that everything is nicely set up for MIRA.

arcadia:/some/path/lenskitest/data$ ln -s ../origdata/NC_012967.gbk REL8593A-5000000_backbone_in.gbf
arcadia:/some/path/lenskitest/data$ ls -l
-rw-r--r-- 1 bach bach       1349 2009-12-06 17:05 prepdata.sh
lrwxrwxrwx 1 bach bach         25 2009-12-06 16:39 REL8593A-5000000_backbone_in.gbf -> ../origdata/NC_012967.gbk
-rw-r--r-- 1 bach bach 1553532192 2009-12-06 15:36 REL8593A-5000000_in.solexa.fastq
-rw-r--r-- 1 bach bach  218188232 2009-12-06 15:36 REL8593A-5000000_straindata_in.txt
arcadia:/some/path/lenskitest/data$ cd ..
arcadia:/some/path/lenskitest$

Perfect, we're ready to start assemblies.

4.7.3.  Starting the mapping assembly

arcadia:/some/path/lenskitest$ cd assemblies
arcadia:/some/path/lenskitest/assemblies$ mkdir 1sttest
arcadia:/some/path/lenskitest/assemblies/1sttest$ lndir ../../data
arcadia:/some/path/lenskitest/assemblies/1sttest$ ls -l
lrwxrwxrwx 1 bach bach         22 2009-12-06 17:18 prepdata.sh -> ../../data/prepdata.sh
lrwxrwxrwx 1 bach bach         43 2009-12-06 16:40 REL8593A-5000000_backbone_in.gbf -> ../../data/REL8593A-5000000_backbone_in.gbf
lrwxrwxrwx 1 bach bach         43 2009-12-06 15:39 REL8593A-5000000_in.solexa.fastq -> ../../data/REL8593A-5000000_in.solexa.fastq
lrwxrwxrwx 1 bach bach         45 2009-12-06 15:39 REL8593A-5000000_straindata_in.txt -> ../../data/REL8593A-5000000_straindata_in.txt

Oooops, we don't need the link prepdata.sh here, just delete it.

arcadia:/some/path/lenskitest/assemblies/1sttest$ rm prepdata.sh

Perfect. Now then, start a simple mapping assembly:

arcadia:/some/path/lenskitest/assemblies/1sttest$ mira 
  --fastq 
  --project=REL8593A-5000000 
  --job=mapping,genome,accurate,solexa
  -SB:lsd=yes:bsn=ECO_B_REL606:bft=gbf
  >&log_assembly.txt
[Note]Note 1

The above command has been split in multiple lines for better overview but should be entered in one line. It basically says: load all data in FASTQ format; the project name is REL8593A-5000000 (and therefore all input and output files will have this prefix by default if not chosen otherwise); we want an accurate mapping of Solexa reads against a genome; load strain data of a separate strain file ( [-SB:lsd=yes]); the strain name of the reference sequence is 'ECO_B_REL606' ( [-SB:bsn=ECO_B_REL606]) and the file type containing the reference sequence in a GenBank format ( [-SB:bft=gbf]). Last but not least, redirect the progress output of the assembler to a file named log_assembly.txt.

[Note]Note 2

The above assembly takes approximately 35 minutes on my computer (i7 940 with 12 GB RAM) when using 4 threads (I have '-GE:not=4' additionally). It may be faster or slower on your computer.

[Note]Note 3

You will need some 10.5 GB RAM to get through this. You might get away with a bit less RAM and using swap, but less than 8 GB RAM is not recommended.

Let's have a look at the directory now:

arcadia:/some/path/lenskitest/assemblies/1sttest$ ls -l
-rw-r--r-- 1 bach bach 1463331186 2010-01-27 20:41 log_assembly.txt
drwxr-xr-x 6 bach bach       4096 2010-01-27 20:04 REL8593A-5000000_assembly
lrwxrwxrwx 1 bach bach         43 2009-12-06 16:40 REL8593A-5000000_backbone_in.gbf -> ../../data/REL8593A-5000000_backbone_in.gbf
lrwxrwxrwx 1 bach bach         43 2009-12-06 15:39 REL8593A-5000000_in.solexa.fastq -> ../../data/REL8593A-5000000_in.solexa.fastq
lrwxrwxrwx 1 bach bach         45 2009-12-06 15:39 REL8593A-5000000_straindata_in.txt -> ../../data/REL8593A-5000000_straindata_in.txt

Not much which changed. All files created by MIRA will be in the REL8593A-5000000_assembly directory. Going one level down, you'll see 4 sub-directories:

arcadia:/some/path/lenskitest/assemblies/1sttest$ cd REL8593A-5000000_assembly
arcadia:.../1sttest/REL8593A-5000000_assembly$ ls -l
drwxr-xr-x 2 bach bach 4096 2010-01-27 20:29 REL8593A-5000000_d_chkpt
drwxr-xr-x 2 bach bach 4096 2010-01-27 20:40 REL8593A-5000000_d_info
drwxr-xr-x 2 bach bach 4096 2010-01-27 20:30 REL8593A-5000000_d_tmp
drwxr-xr-x 2 bach bach 4096 2010-01-27 21:19 REL8593A-5000000_d_results

You can safely delete the tmp and the chkpt directories, in this walkthrough they are not needed anymore.

4.7.4.  Looking at results

Results will be in a sub-directories created by MIRA. Let's go there and have a look.

arcadia:/some/path/lenskitest/assemblies/1sttest$ cd REL8593A-5000000_assembly
arcadia:.../1sttest/REL8593A-5000000_assembly$ cd REL8593A-5000000_d_results
arcadia:.../REL8593A-5000000_d_results$ ls -l
-rw-r--r-- 1 bach bach  455087340 2010-01-27 20:40 REL8593A-5000000_out.ace
-rw-r--r-- 1 bach bach  972479972 2010-01-27 20:38 REL8593A-5000000_out.caf
-rw-r--r-- 1 bach bach  569619434 2010-01-27 20:38 REL8593A-5000000_out.maf
-rw-r--r-- 1 bach bach    4708371 2010-01-27 20:39 REL8593A-5000000_out.padded.fasta
-rw-r--r-- 1 bach bach   14125036 2010-01-27 20:39 REL8593A-5000000_out.padded.fasta.qual
-rw-r--r-- 1 bach bach  472618709 2010-01-27 20:39 REL8593A-5000000_out.tcs
-rw-r--r-- 1 bach bach    4707025 2010-01-27 20:39 REL8593A-5000000_out.unpadded.fasta
-rw-r--r-- 1 bach bach   14120999 2010-01-27 20:39 REL8593A-5000000_out.unpadded.fasta.qual
-rw-r--r-- 1 bach bach   13862715 2010-01-27 20:39 REL8593A-5000000_out.wig

You can see that MIRA has created output in many different formats suited for a number of different applications. Most commonly known will be ACE and CAF for their use in finishing programs (e.g. gap4 and consed).

In a different directory (the info directory) there are also files containing all sorts of statistics and useful information.

arcadia:.../REL8593A-5000000_d_results$ cd ../REL8593A-5000000_d_info/
arcadia:.../REL8593A-5000000_d_info$ ls -l
-rw-r--r-- 1 bach bach     2256 2010-01-27 20:40 REL8593A-5000000_info_assembly.txt
-rw-r--r-- 1 bach bach      124 2010-01-27 20:04 REL8593A-5000000_info_callparameters.txt
-rw-r--r-- 1 bach bach    37513 2010-01-27 20:37 REL8593A-5000000_info_consensustaglist.txt
-rw-r--r-- 1 bach bach 28522692 2010-01-27 20:37 REL8593A-5000000_info_contigreadlist.txt
-rw-r--r-- 1 bach bach      176 2010-01-27 20:37 REL8593A-5000000_info_contigstats.txt
-rw-r--r-- 1 bach bach 15359354 2010-01-27 20:40 REL8593A-5000000_info_debrislist.txt
-rw-r--r-- 1 bach bach 45802751 2010-01-27 20:37 REL8593A-5000000_info_readtaglist.txt

Just have a look at them to get a feeling what they show. You'll find more information regarding these files in that main manual of MIRA. At the moment, let's just have a quick assessment of the differences between the Lenski reference strain and the REL8593A train by counting how many SNPs MIRA thinks there are (marked with SROc tags in the consensus):

arcadia:.../REL8593A-5000000_d_info$ grep -c SROc REL8593A-5000000_info_consensustaglist.txt
102

102 bases are marked with such a tag. You will later see that this is an overestimation due to several insert sites and deletions, but it's a good first approximation.

Let's count how many potential deletion sites REL8593A has in comparison to the reference strain:

arcadia:.../REL8593A-5000000_d_info$ grep -c MCVc REL8593A-5000000_info_consensustaglist.txt
48

This number too is a slight overestimation due to cross-contamination with sequenced strain which did not have these deletions, but it's also a first approximate.

4.7.5.  Post-processing with gap4 and re-exporting to MIRA

To have a look at your project in gap4, use the caf2gap program (you can get it at the Sanger Centre), and then gap4:

arcadia:.../REL8593A-5000000_d_results$ ls -l
-rw-r--r-- 1 bach bach  455087340 2010-01-27 20:40 REL8593A-5000000_out.ace
-rw-r--r-- 1 bach bach  972479972 2010-01-27 20:38 REL8593A-5000000_out.caf
-rw-r--r-- 1 bach bach  569619434 2010-01-27 20:38 REL8593A-5000000_out.maf
-rw-r--r-- 1 bach bach    4708371 2010-01-27 20:39 REL8593A-5000000_out.padded.fasta
-rw-r--r-- 1 bach bach   14125036 2010-01-27 20:39 REL8593A-5000000_out.padded.fasta.qual
-rw-r--r-- 1 bach bach  472618709 2010-01-27 20:39 REL8593A-5000000_out.tcs
-rw-r--r-- 1 bach bach    4707025 2010-01-27 20:39 REL8593A-5000000_out.unpadded.fasta
-rw-r--r-- 1 bach bach   14120999 2010-01-27 20:39 REL8593A-5000000_out.unpadded.fasta.qual
-rw-r--r-- 1 bach bach   13862715 2010-01-27 20:39 REL8593A-5000000_out.wig
arcadia:.../REL8593A-5000000_d_results$ caf2gap -project REL8593A -ace REL8593A-5000000_out.caf >&/dev/null
arcadia:.../REL8593A-5000000_d_results$ ls -l
-rw-r--r-- 1 bach bach 1233494048 2010-01-27 20:43 REL8593A.0
-rw-r--r-- 1 bach bach  233589448 2010-01-27 20:43 REL8593A.0.aux
-rw-r--r-- 1 bach bach  455087340 2010-01-27 20:40 REL8593A-5000000_out.ace
-rw-r--r-- 1 bach bach  972479972 2010-01-27 20:38 REL8593A-5000000_out.caf
-rw-r--r-- 1 bach bach  569619434 2010-01-27 20:38 REL8593A-5000000_out.maf
-rw-r--r-- 1 bach bach    4708371 2010-01-27 20:39 REL8593A-5000000_out.padded.fasta
-rw-r--r-- 1 bach bach   14125036 2010-01-27 20:39 REL8593A-5000000_out.padded.fasta.qual
-rw-r--r-- 1 bach bach  472618709 2010-01-27 20:39 REL8593A-5000000_out.tcs
-rw-r--r-- 1 bach bach    4707025 2010-01-27 20:39 REL8593A-5000000_out.unpadded.fasta
-rw-r--r-- 1 bach bach   14120999 2010-01-27 20:39 REL8593A-5000000_out.unpadded.fasta.qual
-rw-r--r-- 1 bach bach   13862715 2010-01-27 20:39 REL8593A-5000000_out.wig

arcadia:.../REL8593A-5000000_d_results$ gap4 REL8593A.0

Search for the tags set by MIRA which denoted features or problems (SROc, WRMc, MCVc, UNSc, IUPc. See main manual for full list) in the assembly, and edit accordingly. Save your gap4 database as a new version (e.g. REL8593A.1), then exit gap4.

Then use the gap2caf command (also from the Sanger Centre) to convert the gap4 database back to CAF.

arcadia:.../REL8593A-5000000_d_results$ gap2caf -project REL8593A.1 >rel8593a_edited.caf

As gap4 jumbled the consensus (it does not know different sequencing technologies), having convert_project recalculate the consensus (with the "-r c" option) is generally a good idea.

arcadia:.../REL8593A-5000000_d_results$ convert_project -f caf -t caf -r c rel8593a_edited.caf rel8593a_edited_recalled

4.7.6.  Converting mapping results into HTML and simple spreadsheet tables for biologists

You will have to use either CAF or MAF as input, either of which can be the direct result from the MIRA assembly or an already cleaned and edited file. For the sake of simplicity, we'll use the file created by MIRA in the steps above.

Let's start with a HTML file showing all positions of interest:

arcadia:.../REL8593A-5000000_d_results$ convert_project -f caf -t hsnp REL8593A-5000000_out.caf rel8593a
arcadia:.../REL8593A-5000000_d_results$ ls -l *html
-rw-r--r-- 1 bach bach 5198791 2010-01-27 20:49 rel8593a_info_snpenvironment.html

But MIRA can do even better: create tables ready to be imported in spreadsheet programs.

arcadia:.../REL8593A-5000000_d_results$ convert_project -f caf -t asnp REL8593A-5000000_out.caf rel8593a
arcadia:.../REL8593A-5000000_d_results$ ls -l rel8593a*
-rw-r--r-- 1 bach bach      25864 2010-01-27 20:48 rel8593a_info_featureanalysis.txt
-rw-r--r-- 1 bach bach   12402905 2010-01-27 20:48 rel8593a_info_featuresequences.txt
-rw-r--r-- 1 bach bach     954473 2010-01-27 20:48 rel8593a_info_featuresummary.txt
-rw-r--r-- 1 bach bach    5198791 2010-01-27 20:49 rel8593a_info_snpenvironment.html
-rw-r--r-- 1 bach bach      13810 2010-01-27 20:47 rel8593a_info_snplist.txt

Have a look at all file, perhaps starting with the SNP list, then the feature analysis, then the feature summary (your biologists will love that one, especially when combined with filters in the spreadsheet program) and then the feature sequences.

5.  De-novo Solexa only assemblies

This is actually quite straightforward if you name your reads according to the MIRA standard for input files. Assume you have the following files (bchocse being an example for your mnemonic for the project):

arcadia:/path/to/myProject$ ls -l
-rw-r--r-- 1 bach users 263985896 2008-03-28 21:49 bchocse_in.solexa.fastq

5.1.  Without paired-end

Here's the simplest way to start the assembly:

arcadia:/path/to/myProject$ mira 
  --project=bchocse
  --job=denovo,genome,accurate,solexa 
  >&log_assembly.txt

Of course, you can add any other switch you want like, e.g., changing the number of processors used, adding default strain names etc.pp

5.2.  With paired-end (only one library size)

If you have only one library with one insert size, you just need to tell MIRA this minimum and maximum distance the reads should be away from each other. In the following example I have a library size of 500 bp and have set the minimum and maximum distance to +/- 50% (you might want to use other modifiers):

arcadia:/path/to/myProject$ mira 
  --project=bchocse 
  --job=denovo,genome,accurate,solexa 
  SOLEXA_SETTINGS -GE:tismin=250:tismax=750
  >&log_assembly.txt
[Note]Note

For this example to work, make sure that the read pairs are named using the Solexa standard, i.e., having /1 for one read and /2 for the other read. If yours have a different naming scheme, look up the -LR:rns parameter in the main documentation.

5.3.  With paired-end (several library sizes)

To tell MIRA exactly which reads have which insert size, one must use an XML file containing ancillary data in NCBI TRACEINFO format. In case you don't have such a file, here's a very simple example containing only insert sizes for reads (lane 1 has a library size of 500 bases and lane 2 a library size of 2 Kb):

<?xml version="1.0"?>
<trace_volume>
<trace>
<trace_name>1_17_510_1281/1</trace_name>
<insert_size>500</insert_size>
<insert_stdev>100</insert_stdev>
</trace>
<trace>
<trace_name>1_17_510_1281/2</trace_name>
<insert_size>500</insert_size>
<insert_stdev>100</insert_stdev>
</trace>
...
<trace>
<trace_name>2_17_857_850/1</trace_name>
<insert_size>2000</insert_size>
<insert_stdev>300</insert_stdev>
</trace>
<trace>
<trace_name>2_17_857_850/2</trace_name>
<insert_size>2000</insert_size>
<insert_stdev>300</insert_stdev>
</trace>
...
</trace_volume>

So, if your directory looks like this:

arcadia:/path/to/myProject$ ls -l
-rw-r--r-- 1 bach users 263985896 2008-03-28 21:49 bchocse_in.solexa.fastq
-rw-r--r-- 1 bach users 324987513 2008-04-01 13:24 bchocse_traceinfo_in.solexa.xml
      

then starting the assembly is done like this (note the additional [-LR:mxti] parameter in the section for Solexa setting):

arcadia:/path/to/myProject$ mira
  --project=bchocse 
  --job=denovo,genome,accurate,solexa 
  SOLEXA_SETTINGS -LR:mxti=yes
  >&log_assembly.txt

6.  De-novo hybrid assemblies (Solexa + ...)

Two strategies can be thought of to assemble genomes using a combination of Solexa and other (longer) reads: either using all reads for a full de-novo assembly or first assembling the longer reads and use the resulting assembly as backbone to map Solexa reads. Both strategies have their pro and cons.

6.1.  All reads de-novo

Throwing all reads into a de-novo assembly is the most straightforward way to get 'good' assemblies. This strategy is also the one which - in most cases - yields the longest contigs as, in many projects, parts of a genome not covered by one sequencing technology will probably be covered by another sequencing technology. Furthermore, having the consensus covered by more than one sequencing technology make base calling a pretty robust thing: if MIRA finds disagreements it cannot resolve easily, the assembler at least leaves a tag in the assembly to point human finishers to these positions of interest.

The downside of this approach however is the fact that the sheer amount of data in Solexa sequencing projects makes life difficult for de-novo assemblers, especially for MIRA which is keeping quite some additional information in memory in de-novo assemblies and tries to use algorithms as exact as possible during contig construction. Therefore, MIRA sometimes still runs into data sets which make it behave quite badly with respect to assembly time and memory consumption (but this is being constantly improved).

Full de-novo hybrid assemblies can be recommended only for bacteria at the moment, although lower eukaryotes should also be feasible on larger machines.

6.1.1.  Starting the assembly

Starting the assembly is now just a matter of a simple command line with some parameters set correctly. The following is a de-novo hybrid assembly with 454 and Solexa reads.

arcadia:/path/to/myProject$ mira 
  --project=bchocse --job=denovo,genome,normal,454,solexa
  >&log_assembly.txt

6.2.  Long reads first, then Solexa

This strategy works in two steps: first assembling long reads, then mapping short reads to the full alignment (not just a consensus sequence). The result will be an assembly containing 454 (or Sanger) and Solexa reads.

6.2.1.  Step 1: assemble the 'long' reads (454 or Sanger or both)

Assemble your data just as you would when assembling 454 or Sanger data.

6.2.2.  Step 2: filter the results

This step fetches 'long' contigs from the assembly before. Idea is to get all contigs larger than 500 bases.

$ convert_project -f caf -t caf -x 500 assemblyresult.caf hybrid_backbone_in.caf

You might eventually want to add an additional filter for minimum average coverage. If your project has an average coverage of 24, you should filter for a minimum average coverage of 33% (coverage 8, you might want to try out higher coverages) like this:

$ convert_project -f caf -t caf -x 500 -y 8 assemblyresult.caf hybrid_backbone_in.caf

6.2.3.  Step 3: map the Solexa data

Copy the hybrid backbone to a new empty directory, add in the Solexa data, start a mapping assembly using the CAF as input for the backbone. If you assembled the 454 / Sanger data with strain info, the Solexa data should also get those (as described above).

arcadia:/path/to/myProject$ ls -l
-rw-r--r-- 1 bach bach 1159280980 2009-10-31 19:46 hybrid_backbone_in.caf
-rw-r--r-- 1 bach bach  338430282 2009-10-31 20:31 hybrid_in.solexa.fastq
arcadia:/path/to/myProject$ mira 
  --project=hybrid --job=mapping,genome,accurate,solexa
  -AS:nop=1
  -SB:bft=caf
  >&log_assembly.txt

7.  Post-processing of assemblies

This section is a bit terse, you should also read the chapter on working with results of MIRA3.

7.1.  Post-processing mapping assemblies

When working with resequencing data and a mapping assembly, I always load finished projects into an assembly editor and perform a quick cleanup of the results.

For close relatives of the reference strain this doesn't take long as MIRA will have set tags (see section earlier in this document) at all sites you should have a look at. For example, very close mutant bacteria with just SNPs or simple deletions and no genome reorganisation, I usually clean up in 10 to 15 minutes. That gives the last boost to data quality and your users (biologists etc.) will thank you for that as it reduces their work in analysing the data (be it looking at data or performing wet-lab experiments).

Assume you have the following result files in the result directory of a MIRA assembly:

arcadia:/path/to/myProject/newstrain_d_results$ ls -l
-rw-r--r-- 1 bach bach 312607561 2009-06-08 14:57 newstrain_out.ace
-rw-r--r-- 1 bach bach 655176303 2009-06-08 14:56 newstrain_out.caf
...

The general workflow I use is to convert the CAF file to a gap4 database and start the gap4 editor:

arcadia:newstrain_d_results$ caf2gap -project NEWSTRAIN -ace newstrain_out.caf >& /dev/null
arcadia:newstrain_d_results$ gap4 NEWSTRAIN.0

Then, in gap4, I

  1. quickly search for the UNSc and WRMc tags and check whether they could be real SNPs that were overseen by MIRA. In that case, I manually set a SROc (or SIOc) tag in gap4 via hotkeys that were defined to set these tags.

  2. sometimes also quickly clean up reads that are causing trouble in alignments and lead to wrong base calling. These can be found at sites with UNSc tags, most of the time they have the 5' to 3' GGCxG motif which can cause trouble to Solexa.

  3. look at sites with deletions (tagged with MCVc) and look whether I should clean up the borders of the deletion.

After this, I convert the gap4 database back to CAF format:

$ gap2caf -project NEWSTRAIN >newstrain_edited.caf

But beware: gap4 does not have the same consensus calling routines as MIRA and will have saved it's own consensus in the new CAF. In fact, gap4 performs rather badly in projects with multiple sequencing technologies. So I use convert_project from the MIRA package to recall a good consensus (and save it in MAF as it's more compact and a lot faster in handling than CAF):

$ convert_project -f caf -t maf -r c newstrain_edited.caf newstrain_edited_recalled

And from this file I can then convert with convert_project to any other format I or my users need: CAF, FASTA, ACE, WIG (for coverage analysis) etc.pp.

I can also also generate tables and HTML files with SNP analysis results (with the "-t asnp" and "-t hsnp" options of convert_project)

7.2.  Post-processing de-novo assemblies

As the result file of MIRA de-novo assemblies contains everything down to 'contigs' with just two reads, it is advised to first filter out all contigs which are smaller than a given size or have a coverage lower than 1/3 to 1/2 of the overall coverage.

Filtering is performed by convert_project using CAF file as input. Assume you have the following file:

arcadia:/path/to/myProject/newstrain_d_results$ ls -l
...
-rw-r--r-- 1 bach bach 655176303 2009-06-08 14:56 newstrain_out.caf
...

Let's say you have a hybrid assembly with an average coverage of 50x. I normally filter out all contigs which have an average coverage less than 1/3 and are smaller than 500 bases. These are mostly junk contiglets remaining from the assembly and can be more or less safely ignored. This is done the following way:

arcadia:newstrain_d_results$ convert_project
  -f caf -t caf -x 500 -y 17 newstrain_out.caf newstrain_filterx500y17

From there on, convert the filtered CAF file to anything you need to continue finishing of the genome (gap4 database, ACE, etc.pp).

8.  Known bugs / problems

These are actual for version 3 of MIRA and might or might not have been addressed in later version.

Bugs:

  1. mapping of paired-end reads with one read being in non-repetitive area and the other in a repeat is not as effective as it should be. The optimal strategy to use would be to map first the non-repetitive read and then the read in the repeat. Unfortunately, this is not yet implemented in MIRA.

Problems:

  1. the textual output of results is really slow with such massive amounts of data as with Solexa projects. If Solexa data is present, it's turned off by default at the moment.