Table of Contents
“There is no such thing like overkill. ” | ||
--Solomon Short |
Notes of caution:
this guide is still not finished (and may contain old information regarding read lengths in parts), but is should cover most basic use cases.
you need lots of memory ... ~ 1 to 1.5 GiB per million Solexa reads. Using mira for anything more than 50 to 100 million Solexa reads is probably not a good idea.
This guide assumes that you have basic working knowledge of Unix systems, know the basic principles of sequencing (and sequence assembly) and what assemblers do.
While there are step by step instructions on how to setup your Solexa data and then perform an assembly, this guide expects you to read at some point in time
the MIRA reference manual file to look up some command line options as well as general information on what tags MIRA uses in assemblies, files it generates etc.pp
the short usage introduction to MIRA3 so that you have a basic knowledge on how to set up projects in mira for Sanger sequencing projects.
Even very short Solexa reads (< 50bp) are great for mapping assemblies. I simply love them as you can easily spot differences in mutant organisms ... or boost the quality of a newly sequenced genome to Q60.
Regarding de-novo assemblies ... well, from an assembler's point of view, very short reads are a catastrophe, regardless of the sequencing technology.
Repeats. The problem of repetitive sequences (e.g. rRNA stretches in bacteria) gets worse the shorter the read lengths get.
Amount of data. As mira is by heart an assembler to resolve difficult repetitive problems as they occur in Sanger and 454 reads, it drags along quite a lot of ancillary information which is useless in Solexa assemblies ... but still eats away memory
Things look better for the now available 'longer' Solexa reads. Starting with a length of 75bp and paired-end data, de-novo for bacteria is not that bad at all. The first Solexas with a length of ~110 bases are appearing in public, and from a contig building these are about as good for de-novo as the first 454 GS20 reads were.
Here's the rule of thumb I use: the longer, the better. If you have to pay a bit more to get longer reads (e.g. Solexa 100mers instead of 75mers), go get the longer reads. With these, the results you generate are way(!) better than with 36, 50 or even 75mers ... both in mapping and de-novo. Don't try to save a couple of hundred bucks in sequencing, you'll pay dearly afterwards in assembly.
Note: This section contains things I've seen in the past and simply jotted down. You may have different observations.
For 36mers and the MIRA proposed-end-clipping, even in the old pipeline I get about 90 to 95% reads matching to a reference without a single error. For 72mers, the number is approximately 5% lower, 100mers another 5% less. Still, these are great numbers.
The new base calling pipeline (1.4 or 2.4?) rolled out by Illumina in Q1/Q2 2009 typically yields 20-50% more data from the very same images. Furthermore, the base calling is way better than in the old pipeline. For Solexa 76 mers, after trimming I get only 1% real junk, between 85 and 90% of the reads are matching to a reference without a single error. Of the remaining reads, roughly 50% have one error, 25% have two errors, 12.5% have three errors etc.
It is worthwhile to re-analyse your old data if the images are still around.
Long homopolymers (stretches of identical bases in reads) can be a slight problem for Solexa. However, it must be noted that this is a problem of all sequencing technologies on the market so far (Sanger, Solexa, 454). Furthermore, the problem in much less pronounced in Solexa than in 454 data: in Solexa, first problem appear may appear in stretches of 9 to 10 bases, in 454 a stretch of 3 to 4 bases may already start being problematic in some reads.
GGCxG
or even GGC
motif in the
5' to 3' direction of reads. This one is particularly annoying and
it took me quite a while to circumvent in MIRA the problems it
causes.
Simply put: at some places in a genome, base calling after a
GGCxG
or GGC
motif is
particularly error prone, the number of reads without errors
declines markedly. Repeated GGC
motifs worsen
the situation. The following screenshots of a mapping assembly
illustrate this.
The first example is a the GGCxG
motif (in form
of a GGCTG
) occuring in approximately one third
of the reads at the shown position. Note that all but one read
with this problem are in the same (plus) direction.
The next two screenshots show the GGC
, once for
forward direction and one with reverse direction reads:
Places in the genome that have GGCGGC.....GCCGCC
(a motif, perhaps even repeated, then some bases and then an
inverted motif) almost always have very, very low number of good
reads. Especially when the motif is GGCxG
.
Things get especially difficult when these motifs occur at sites
where users may have a genuine interest. The following example is a
screenshot from the Lenski data (see walk-through below) where a
simple mapping reveals an anomaly which -- in reality -- is an IS
insertion (see http://www.nature.com/nature/journal/v461/n7268/fig_tab/nature08480_F1.html)
but could also look like a GGCxG
motif in forward
direction (GGCCG
) and at the same time a
GGC
motif in reverse direction:
Here I'm recycling a few slides from a couple of talks I held in 2010.
Things used to be so nice and easy with the early Solexa data I worked with (36 and 44mers) in late 2007 / early 2008. When sample taking was done right -- e.g. for bacteria: in stationary phase -- and the sequencing lab did a good job, the read coverage of the genome was almost even. I did see a few papers claiming to see non-trivial GC bias back then, but after having analysed the data I worked with I dismissed them as "not relevant for my use cases." Have a look at the following figure showing exemplarily the coverage of a 45% GC bacterium in 2008:
Figure 5. Example for no GC coverage bias in 2008 Solexa data. Apart from a slight smile shape of the coverage -- indicating the sample taking was not 100% in stationary phase of the bacterial culture -- everything looks pretty nice: the average coverage is at 27x, and when looking at potential genome duplications at twice the coverage (54x), there's nothing apart a single peak (which turned out to be a problem in a rRNA region).
![]() |
Things changed starting somewhen in Q3 2009, at least that's when I got some data which made me notice a problem. Have a look at the following figure which shows exactly the same organism as in the figure above (bacterium, 45% GC):
Figure 6. Example for GC coverage bias starting Q3 2009 in Solexa data. There's no smile shape anymore -- the people in the lab learned to pay attention to sample in 100% stationary phase -- but something else is extremely disconcerting: the average coverage is at 33x, and when looking at potential genome duplications at twice the coverage (66x), there are several dozen peaks crossing the 66x threshold over a several kilobases (in one case over 200 Kb) all over the genome. As if several small genome duplications happened.
![]() |
By the way, the figures above are just examples: I saw over a dozen sequencing projects in 2008 without GC bias and several dozen in 2009 / 2010 with GC bias.
Checking the potential genome duplication sites, they all looked "clean", i.e., the typical genome insertion markers are missing. Poking around at possible explanations, I looked at GC content of those parts in the genome ... and there was the explanation:
Figure 7. Example for GC coverage bias, direct comparison 2008 / 2010 data. The bug has 45% average GC, areas with above average read coverage in 2010 data turn out to be lower GC: around 33 to 36%. The effect is also noticeable in the 2008 data, but barely so.
![]() |
Now as to actually why the GC bias suddenly became so strong is unknown to me. The people in the lab use the same protocol since several years to extract the DNA and the sequencing providers claim to always use the Illumina standard protocols.
But obviously something must have changed. Current ideas about possoble reasons include
It took Illumina some 18 months to resolve that problem for the broader public: since data I work on were done with the TruSeq kit, this problem has vanished.
However, if you based some conclusions or wrote a paper with Illumina data which might be affected by the GC bias (Q3 2009 to Q4 2010), I suggest you rethink all the conclusion drawn. This should be especially the case for transcriptomics experiments where a difference in expression of 2x to 3x starts to get highly significant!
This part will introduce you step by step how to get your data together for a simple mapping assembly.
I'll make up an example using an imaginary bacterium: Bacillus chocorafoliensis (or short: Bchoc).
In this example, we assume you have two strains: a wild type strain of Bchoc_wt and a mutant which you perhaps got from mutagenesis or other means. Let's imagine that this mutant needs more time to eliminate a given amount of chocolate, so we call the mutant Bchoc_se ... SE for slow eater
You wanted to know which mutations might be responsible for the observed behaviour. Assume the genome of Bchoc_wt is available to you as it was published (or you previously sequenced it), so you resequenced Bchoc_se with Solexa to examine mutations.
You need to create (or get from your sequencing provider) the sequencing data in either FASTQ or FASTA + FASTA quality format. The following walkthrough uses what most people nowadays get: FASTQ.
Put the FASTQ data into an empty directory and rename the file so that it looks like this:
arcadia:/path/to/myProject$
ls -l
-rw-r--r-- 1 bach users 263985896 2008-03-28 21:49 bchocse_in.solexa.fastq
The reference sequence (the backbone) can be in a number of different
formats: FASTA, GenBank, CAF. The later two have the advantage of
being able to carry additional information like, e.g., annotation. In
this example, we will use a GenBank file like the ones one can
download from the NCBI. So, let's assume that our wild type strain is
in the following file: NC_someNCBInumber.gbk
. Copy this
file to the directory (you may also set a link), renaming it as
bchocse_backbone_in.gbf
.
arcadia:/path/to/myProject$
cp /somewhere/NC_someNCBInumber.gbk bchocse_backbone_in.gbf
arcadia:/path/to/myProject$
ls -l
-rw-r--r-- 1 bach users 6543511 2008-04-08 23:53 bchocse_backbone_in.gbf -rw-r--r-- 1 bach users 263985896 2008-03-28 21:49 bchocse_in.solexa.fastq
Starting the assembly is now just a matter of a simple command line with some parameters set correctly. The following is an example of what I use when mapping onto a reference sequence in GenBank format:
arcadia:/path/to/myProject$
mira --project=bchocse --job=mapping,genome,accurate,solexa -AS:nop=1 -SB:bsn=bchoc_wt:bft=gbf:bbq=30 SOLEXA_SETTINGS -SB:ads=yes:dsn=bchocse >&log_assembly.txt
![]() | Note 1 |
---|---|
The above command has been split in multiple lines for better overview but should be entered in one line. |
![]() | Note 2 |
---|---|
Please look up the parameters used in the main manual. The ones above basically say: make an accurate mapping of Solexa reads against a genome; in one pass; the name of the backbone strain is 'bchoc_wt'; the file type containing backbone is a GenBank file; the base qualities for the backbone are to be assumed Q30; for Solexa data: assign default strain names for reads which have not loaded ancilarry data with strain info and that default strain name should be 'bchocse'. |
![]() | Note 3 |
---|---|
For a bacterial project having a backbone of ~4 megabases and with ~4.5 million Solexa 36mers, MIRA needs some ~21 minutes on my development machine. A yeast project with a genome of ~20 megabases and ~20 million 36mers needs 3.5 hours and 28 GiB RAM. |
For this example - if you followed the walk-through on how to prepare the data - everything you might want to adapt in the first time are the following options:
-project (for naming your assembly project)
-SB:bsn to give the backbone strain (your reference strain) another name
-SB:bft to load the backbone sequence from another file type, say, a FASTA
-SB:dsn to give a the Solexa reads another strain name
Of course, you are free to change any option via the extended parameters, but this will be the topic of another FAQ.
MIRA will make use of ancillary information when present. The strain name is such an ancillary information. That is, we can tell MIRA the strain of each read we use in the assembly. In the example above, this information was given on the command line as all the reads to be mapped had the same strain information. But what to do if one wants to map reads from several strains?
We could generate a TRACEINFO XML file with all bells and whistles,
but for strain data there's an easier way: the
straindata
file. It's a simple key-value file,
one line per entry, with the name of the read as key (first entry in
line) and, separated by a blank the name of the strain as value
(second entry in line). E.g.:
1_1_207_113 strain1 1_1_61_711 strain1 1_1_182_374 strain2 ... 2_1_13_654 strain2 ...
Etcetera. You will obviously replace 'strain1' and 'strain2' with your strain names.
This file can be quickly generated automatically, using the extracted names from FASTQ files and rewritten a little bit. Here's how:
arcadia:/path/to/myProject$
ls -l
-rw-r--r-- 1 bach users 494282343 2008-03-28 22:11 bchocse_in.solexa.fastqarcadia:/path/to/myProject$
grep "^@" bchocse_in.solexa.fastq | sed -e 's/@//' | cut -f 1 | cut -f 1 -d ' ' | sed -e 's/$/ bchocse/' > bchocse_straindata_in.txt
arcadia:/path/to/myProject$
ls -l
-rw-r--r-- 1 bach users 494282343 2008-03-28 22:11 bchocse_in.solexa.fastq -rw-r--r-- 1 bach users 134822451 2008-03-28 22:13 bchocse_straindata_in.txt
![]() | Note 1 |
---|---|
The above command has been split in multiple lines for better overview but should be entered in one line. |
![]() | Note 2 |
---|---|
for larger files, this can run a minute or two. |
![]() | Note 3 |
---|---|
As you can also assemble sequences from more that one strain, the
read names in |
This creates the needed data in the file
bchocse_straindata_in.txt
(well, it's one way to
do it, feel free to use whatever suits you best).
When using paired-end data, you must decide whether you want
use the MIRA feature to create long 'coverage equivalent reads' (CERs) which saves a lot of memory (both in the assembler and later on in an assembly editor). However, you then loose paired-end information!
or whether you want to keep paired-end information! at the expense of larger memory requirements both in MIRA and in assembly editors afterwards.
The Illumina pipeline generally gives you two files for paired-end
data: a project-1.fastq
and
project-2.fastq
. The first file containing the
first read of a read-pair, the second file the second read. Depending
on the preprocessing pipeline of your sequencing provider, the names
of the reads can be either the very same in both files or already have
a /1
or /2
appended.
![]() | Note |
---|---|
For running MIRA, you must concatenate all sequence input files into one file. |
If the read names do not follow the /1/2
scheme,
you must obviously rename them in the process. A ltlle
sed command can do this automatically for
you. Assuming your reads all have the prefix SRR_something_
the following line appends /1
to all lines which
begin with @SRR_something_
arcadia:/path/to/myProject$
sed -e 's/^@SRR_something_/&\/1/' input.fastq >output.fastq
If you don't care about the paired-end information, you can start the mapping assembly exactly like an assembly for data without paired-end info (see section above).
In case you want to keep the paired-end information, here's the command line (again an example when mapping against a GenBank reference file, assuming that the library insert size is ~500 bases):
arcadia:/path/to/myProject$
mira --project=bchocse --job=mapping,genome,accurate,solexa -AS:nop=1 -SB:lsd=yes:bsn=bchoc_wt:bft=gbf:bbq=30 SOLEXA_SETTINGS -CO:msr=no -GE:uti=no:tismin=250:tismax=750 -SB:ads=yes:dsn=bchocse >&log_assembly.txt
![]() | Note 1 |
---|---|
For this example to work, make sure that the read pairs are named using the Solexa standard, i.e., having '/1' as postfix to the name of one read and '/2' for the other read. If yours have a different naming scheme, look up the -LR:rns parameter in the main documentation. |
![]() | Note 2 |
---|---|
Please look up the parameters used in the main manual. The ones above basically say: make an accurate mapping of Solexa reads against a genome, in one pass, load additional strain data, the name of the backbone is 'bchoc_wt', the file type containing backbone is a GenBank file, the base qualities for the backbone are to assumed Q30. Additionally, only for Solexa reads, do not merge short reads to the contig, use template size information and set minimum and maximum expected distance to 250 and 750 respectively. |
![]() | Note 3 |
---|---|
You will want to use other values than 250 and 750 if your Solexa paired-end library was not with insert sizes of approximately 500 bases. |
Comparing this command line with a command line for unpaired-data, two parameters were added in the section for Solexa data:
-CO:msr=no
tells MIRA not to merge reads that
are 100% identical to the backbone. This also allows to keep the
template information for the reads.
-GE:uti=no
actually switches
off checking of template sizes when inserting
reads into the backbone. At first glance this might seem
counter-intuitive, but it's absolutely necessary to spot, e.g.,
genome re-arrangements or indels in data analysis after the
assembly.
The reason is that if template size checking were on, the following would happen at, e.g. sites of re-arrangement: MIRA would map the first read of a read-pair without problem. However, it would very probably reject the second read because it would not map at the specified distance from its partner. Therefore, in mapping assemblies with paired-end data, checking of the template size must be switched off.
-GE:tismin:tismax
were set to give the maximum
and minimum distance paired-end reads may be away from each
other. Though the information is not used by MIRA in the assembly
itself, the information is stored in result files and can be used
afterwards by analysis programs which search for genome
re-arrangements.
Note: for other influencing factors you might want to change depending on size of Solexa reads, see section above on mapping of unpaired data.
This section just give a short overview on the tags you might find interesting. For more information, especially on how to configure gap4 or consed, please consult the mira usage document and the mira manual.
In file types that allow tags (CAF, MAF, ACE), SNPs and other interesting features will be marked by MIRA with a number of tags. The following sections give a brief overview. For a description of what the tags are (SROc, WRMc etc.), please read up the section "Tags used in the assembly by MIRA and EdIt" in the main manual.
![]() | Note |
---|---|
Screenshots in this section are taken from the walk-through with Lenski data (see below). |
the SROc tag will point to most SNPs. Should you assemble sequences of more than one strain (I cannot really recommend such a strategy), you also might encounter SIOc and SAOc tags.
the WRMc tags might sometimes point SNPs to indels of one or two bases.
Large deletions: the MCVc tags point to deletions in the resequenced data, where no read is covering the reference genome.
Figure 10. "MCVc" tag (dark red stretch in figure) showing a genome deletion in Solexa mapping assembly.
![]() |
Insertions, small deletions and re-arrangements: these are harder to spot. In unpaired data sets they can be found looking at clusters of SROc, SRMc, WRMc, and / or UNSc tags.
more massive occurences of these tags lead to a rather colourful display in finishing programs, which is why these clusters are also sometimes called Xmas-trees.
In sets with paired-end data, post-processing software (or alignment viewers) can use the read-pair information to guide you to these sites (MIRA doesn't set tags at the moment).
the UNSc tag points to areas
where the consensus algorithm had troubles choosing a base. This
happens in low coverage areas, at places of insertions (compared
to the reference genome) or sometimes also in places where
repeats with a few bases difference are present. Often enough,
these tags are in areas with problematic sequences for the
Solexa sequencing technology like, e.g., a
GGCxG
or even GGC
motif in
the reads.
the SRMc tag points to places where repeats with a few bases difference are present. Here too, sequence problematic for the Solexa technology are likely to have cause base calling errors and subsequently setting of this tag.
Biologists are not really interested in SNPs coordinates, and why should they? They're more interested where SNPs are, how good they are, which genes or other elements they hit, whether they have an effect on a protein sequence, whether they may be important etc. For organisms without intron/exon structure or splice variants, MIRA can generate pretty comprehensive tables and files if an annotated GenBank file was used as reference and strain information was given to MIRA during the assembly.
Well, MIRA does all that automatically for you if the reference sequence you gave was annotated.
For this, convert_project should be used with the asnp format as target and a CAF file as input:
$
convert_project -f caf -t asnp
input.caf output
Note that it is strongly suggested to perform a quick manual cleanup of the assembly prior to this: for rare cases (mainly at site of small indels of one or two bases), mira will not tag SNPs with a SNP tag (SROc, SAOc or SIOc) but will be fooled into a tag denoting unsure positions (UNSc). This can be quickly corrected manually. See further down in this manual in the section on post-processing.
After conversion, you will have four files in the directory which you can all drag-and-drop into spreadsheet applications like OpenOffice Calc or Excel.
The files should be pretty self-explanatory, here's just a short overview:
output_info_snplist.txt
is a simple list of
the SNPs, with their positions compared to the reference
sequence (in bases and map degrees on the genome) as well as the
GenBank features they hit.
output_info_featureanalysis.txt
is a much
extended version of the list above. It puts the SNPs into
context of the features (proteins, genes, RNAs etc.) and gives a
nice list, SNP by SNP, what might cause bigger changes in
proteins.
output_info_featuresummary.txt
looks at the
changes (SNPs, indels) from the other way round. It gives an
excellent overview which features (genes, proteins, RNAs,
intergenic regions) you should investigate.
There's one column (named 'interesting') which pretty much summarises up everything you need into three categories: yes, no, and perhaps. 'Yes' is set if indels were detected, an amino acid changed, start or stop codon changed or for SNPs in intergenic regions and RNAs. 'Perhaps' is set for SNPs in proteins that change a codon, but not an amino acid (silent SNPs). 'No' is set if no SNP is hitting a feature.
output_info_featuresequences.txt
simply
gives the sequences of each feature of the reference sequence
and the resequenced strain.
I've come to realise that people who don't handle data from NextGen sequencing technologies on a regular basis (e.g., many biologists) don't want to be bothered with learning to handle specialised programs to have a look at their resequenced strains. Be it because they don't have time to learn how to use a new program or because their desktop is not strong enough (CPU, memory) to handle the data sets.
Something even biologist know to operate are browsers. Therefore, convert_project has the option to load a CAF file of a mapping assembly at output to HTML those areas which are interesting to biologists. It uses the tags SROc, SAOc, SIOc and MCVc and outputs the surrounding alignment of these areas together with a nice overview and links to jump from one position to the previous or next.
This is done with the '-t hsnp' option of convert_project:
$
convert_project -f caf -t hsnp
input.caf output
Note: I recommend doing this only if the resequenced strain is a very close relative to the reference genome, else the HTML gets pretty big. But for a couple of hundred SNPs it works great.
convert_project can also dump a coverage file in WIG format (using '-t wig'). This comes pretty handy for searching genome deletions or duplications in programs like the Affymetrix Integrated Genome Browser (IGB, see http://igb.bioviz.org/).
We're going to use data published by Richard Lenski in his great paper "Genome evolution and adaptation in a long-term experiment with Escherichia coli". This shows how MIRA finds all mutations between two strains and how one would need just a few minutes to know which genes are affected.
![]() | Note |
---|---|
All steps described in this walkthrough are present in ready-to-be-run
scripts in the solexa3_lenski demo directory of the
MIRA package.
|
![]() | Note |
---|---|
This walkthrough takes a few detours which are not really necessary, but show how things can be done: it reduces the number of reads, it creates a strain data file etc. Actually, the whole demo could be reduced to two steps: downloading the data (naming it correctly) and starting the assembly with a couple of parameters. |
We'll use the reference genome E.coli B REL606 to map one of the strains from the paper. For mapping, I picked strain REL8593A more or less at random. All the data needed is fortunately at the NCBI, let's go and grab it:
the NCBI has REL606 named NC_012967. We'll use the RefSeq version and the GenBank formatted file you can download from ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_B_REL606/NC_012967.gbk
the Solexa re-sequencing data you can get from ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX012/SRX012992/. Download
both FASTQ files, SRR030257_1.fastq.gz
and
SRR030257_2.fastq.gz
.
If you want more info regarding these data sets, have a look at http://www.ncbi.nlm.nih.gov/sra/?db=sra&term=SRX012992&report=full
In this section we will setup the directory structure for the assembly and pre-process the data so that MIRA can start right away.
Let's start with setting up a directory structure. Remember: you can setup the data almost any way you like, this is just how I do things.
I normally create a project directory with three sub-directories:
origdata
, data
, and
assemblies
. In origdata
I
put the files exactly as I got them from the sequencing or data
provider, without touching them and even remowing write permissions
to these files so that they cannot be tampered with. After that, I
pre-process them and put the pre-processed files into
data
. Pre-processing can be a lot of things,
starting from having to re-format the sequences, or renaming them,
perhaps also doing clips etc. Finally, I use these pre-processed
data in one or more assembly runs in the
assemblies
directory, perhaps trying out
different assembly options.
arcadia:/some/path/$
mkdir lenskitest
arcadia:/some/path/$
cd lenskitest
arcadia:/some/path/lenskitest$
mkdir data origdata assemblies
arcadia:/some/path/lenskitest$
ls -l
drwxr-xr-x 2 bach bach 4096 2009-12-06 16:06 assemblies drwxr-xr-x 2 bach bach 4096 2009-12-06 16:06 data drwxr-xr-x 2 bach bach 4096 2009-12-06 16:06 origdata
Now copy the files you just downloaded into the directory
origdata
.
arcadia:/some/path/lenskitest$
cp /wherever/the/files/are/SRR030257_1.fastq.gz origdata
arcadia:/some/path/lenskitest$
cp /wherever/the/files/are/SRR030257_2.fastq.gz origdata
arcadia:/some/path/lenskitest$
cp /wherever/the/files/are/NC_012967.gbk origdata
arcadia:/some/path/lenskitest$
ls -l origdata
-rw-r--r-- 1 bach bach 10543139 2009-12-06 16:38 NC_012967.gbk -rw-r--r-- 1 bach bach 158807975 2009-12-06 15:15 SRR030257_1.fastq.gz -rw-r--r-- 1 bach bach 157595587 2009-12-06 15:21 SRR030257_2.fastq.gz
Great, let's preprocess the data. For this you must know a few things:
the standard Illumina naming scheme for Solexa paired-end reads
is to append forward read names with /1
and
reverse read names with /2
. The reads are
normally put into at least two different files (one for forward,
one for reverse). Now, the Solexa data stored in the Short Read
Archive at the NCBI also has forward and reverse files for
paired-end Solexas. That's OK. What's a bit less good is that
the read names there DO NOT have /1 appended to names of forward
read, or /2 to names of reverse reads. The forward and reverse
reads in both files are just named exactly the same. We'll need
to fix that.
while Sanger and 454 reads should be preprocessed (clipping sequencing vectors, perhaps quality clipping etc.), reads from Solexa present do not. Some people perform quality clipping or clipping of reads with too many 'N's in the sequence, but this is not needed when using MIRA. In fact, MIRA will perform everything needed for Solexa reads itself and will generally do a much better job as the clipping performed is independent of Solexa quality values (which are not always the most trustworthy ones).
for a mapping assembly, it's good to give the strain name of the backbone and the strain name for the reads mapped against. The former can be done via command line, the later is done for each read individually in a key-value file (the straindata file).
So, to pre-process the data, we will need to
put the reads of the NCBI forward and reverse pairs into one file
append /1
to the names of forward reads, and
/2
for reverse reads.
create a straindata file for MIRA
To ease things for you, I've prepared a small script which will do everything for you: copy and rename the reads as well as creating strain names. Note that it's a small part of a more general script which I use to sometimes sample subsets of large data sets, but for the Lenski data set is small enough so that everything is taken.
Create a file prepdata.sh
in directory
data
and copy paste the following into it:
###################################################################### ####### ####### Prepare paired-end Solexa downloaded from NCBI ####### ###################################################################### # srrname: is the SRR name as downloaded form NCBI SRA # numreads: maximum number of forward (and reverse) reads to take from # each file. Just to avoid bacterial projects with a coverage # of 200 or so. # strainname: name of the strain which was re-sequenced srrname="SRR030257" numreads=5000000 strainname="REL8593A" ################################ numlines=$((4*${numreads})) # put "/1" Solexa reads into file echo "Copying ${numreads} reads from _1 (forward reads)" zcat ../origdata/${srrname}_1.fastq.gz | head -${numlines} | sed -e 's/SRR[0-9.]*/&\/1/' >${strainname}-${numreads}_in.solexa.fastq # put "/2" Solexa reads into file echo "Copying ${numreads} reads from _2 (reverse reads)" zcat ../origdata/${srrname}_2.fastq.gz | head -${numlines} | sed -e 's/SRR[0-9.]*/&\/2/' >>${strainname}-${numreads}_in.solexa.fastq # make file with strainnames echo "Creating file with strain names for copied reads (this may take a while)." grep "@SRR" ${strainname}-${numreads}_in.solexa.fastq | cut -f 1 -d ' ' | sed -e 's/@//' -e "s/$/ ${strainname}/" >>${strainname}-${numreads}_straindata_in.txt
Now, let's create the needed data:
arcadia:/some/path/lenskitest$cd data
arcadia:/some/path/lenskitest/data$ls -l
-rw-r--r-- 1 bach bach 1349 2009-12-06 17:05 prepdata.sh arcadia:/some/path/lenskitest/data$sh prepdata.sh
Copying 5000000 reads from _1 (forward reads) Copying 5000000 reads from _2 (reverse reads) Creating file with strain names for copied reads (this may take a while). arcadia:/some/path/lenskitest/data$ls -l
-rw-r--r-- 1 bach bach 1349 2009-12-06 17:05 prepdata.sh -rw-r--r-- 1 bach bach 1553532192 2009-12-06 15:36 REL8593A-5000000_in.solexa.fastq -rw-r--r-- 1 bach bach 218188232 2009-12-06 15:36 REL8593A-5000000_straindata_in.txt
Last step, just for the sake of completeness, link in the GenBank formatted file of the reference strain, giving it the same base name so that everything is nicely set up for MIRA.
arcadia:/some/path/lenskitest/data$
ln -s ../origdata/NC_012967.gbk REL8593A-5000000_backbone_in.gbf
arcadia:/some/path/lenskitest/data$
ls -l
-rw-r--r-- 1 bach bach 1349 2009-12-06 17:05 prepdata.sh lrwxrwxrwx 1 bach bach 25 2009-12-06 16:39 REL8593A-5000000_backbone_in.gbf -> ../origdata/NC_012967.gbk -rw-r--r-- 1 bach bach 1553532192 2009-12-06 15:36 REL8593A-5000000_in.solexa.fastq -rw-r--r-- 1 bach bach 218188232 2009-12-06 15:36 REL8593A-5000000_straindata_in.txtarcadia:/some/path/lenskitest/data$
cd ..
arcadia:/some/path/lenskitest$
Perfect, we're ready to start assemblies.
arcadia:/some/path/lenskitest$
cd assemblies
arcadia:/some/path/lenskitest/assemblies$
mkdir 1sttest
arcadia:/some/path/lenskitest/assemblies/1sttest$
lndir ../../data
arcadia:/some/path/lenskitest/assemblies/1sttest$
ls -l
lrwxrwxrwx 1 bach bach 22 2009-12-06 17:18 prepdata.sh -> ../../data/prepdata.sh lrwxrwxrwx 1 bach bach 43 2009-12-06 16:40 REL8593A-5000000_backbone_in.gbf -> ../../data/REL8593A-5000000_backbone_in.gbf lrwxrwxrwx 1 bach bach 43 2009-12-06 15:39 REL8593A-5000000_in.solexa.fastq -> ../../data/REL8593A-5000000_in.solexa.fastq lrwxrwxrwx 1 bach bach 45 2009-12-06 15:39 REL8593A-5000000_straindata_in.txt -> ../../data/REL8593A-5000000_straindata_in.txt
Oooops, we don't need the link prepdata.sh
here, just delete it.
arcadia:/some/path/lenskitest/assemblies/1sttest$
rm prepdata.sh
Perfect. Now then, start a simple mapping assembly:
arcadia:/some/path/lenskitest/assemblies/1sttest$
mira --fastq --project=REL8593A-5000000 --job=mapping,genome,accurate,solexa -SB:lsd=yes:bsn=ECO_B_REL606:bft=gbf >&log_assembly.txt
![]() | Note 1 |
---|---|
The above command has been split in multiple lines for better
overview but should be entered in one line. It basically says:
load all data in FASTQ format; the project name is
REL8593A-5000000 (and therefore all input and
output files will have this prefix by default if not chosen
otherwise); we want an accurate mapping of Solexa reads against a
genome; load strain data of a separate strain file
( [-SB:lsd=yes]); the strain name of the reference
sequence is 'ECO_B_REL606' ( [-SB:bsn=ECO_B_REL606]) and
the file type containing the reference sequence in a GenBank
format ( [-SB:bft=gbf]). Last but not least, redirect the
progress output of the assembler to a file named
|
![]() | Note 2 |
---|---|
The above assembly takes approximately 35 minutes on my computer (i7 940 with 12 GB RAM) when using 4 threads (I have '-GE:not=4' additionally). It may be faster or slower on your computer. |
![]() | Note 3 |
---|---|
You will need some 10.5 GB RAM to get through this. You might get away with a bit less RAM and using swap, but less than 8 GB RAM is not recommended. |
Let's have a look at the directory now:
arcadia:/some/path/lenskitest/assemblies/1sttest$
ls -l
-rw-r--r-- 1 bach bach 1463331186 2010-01-27 20:41 log_assembly.txt drwxr-xr-x 6 bach bach 4096 2010-01-27 20:04 REL8593A-5000000_assembly lrwxrwxrwx 1 bach bach 43 2009-12-06 16:40 REL8593A-5000000_backbone_in.gbf -> ../../data/REL8593A-5000000_backbone_in.gbf lrwxrwxrwx 1 bach bach 43 2009-12-06 15:39 REL8593A-5000000_in.solexa.fastq -> ../../data/REL8593A-5000000_in.solexa.fastq lrwxrwxrwx 1 bach bach 45 2009-12-06 15:39 REL8593A-5000000_straindata_in.txt -> ../../data/REL8593A-5000000_straindata_in.txt
Not much which changed. All files created by MIRA will be in the REL8593A-5000000_assembly directory. Going one level down, you'll see 4 sub-directories:
arcadia:/some/path/lenskitest/assemblies/1sttest$
cd REL8593A-5000000_assembly
arcadia:.../1sttest/REL8593A-5000000_assembly$
ls -l
drwxr-xr-x 2 bach bach 4096 2010-01-27 20:29 REL8593A-5000000_d_chkpt drwxr-xr-x 2 bach bach 4096 2010-01-27 20:40 REL8593A-5000000_d_info drwxr-xr-x 2 bach bach 4096 2010-01-27 20:30 REL8593A-5000000_d_tmp drwxr-xr-x 2 bach bach 4096 2010-01-27 21:19 REL8593A-5000000_d_results
You can safely delete the tmp and the chkpt directories, in this walkthrough they are not needed anymore.
Results will be in a sub-directories created by MIRA. Let's go there and have a look.
arcadia:/some/path/lenskitest/assemblies/1sttest$
cd REL8593A-5000000_assembly
arcadia:.../1sttest/REL8593A-5000000_assembly$
cd REL8593A-5000000_d_results
arcadia:.../REL8593A-5000000_d_results$
ls -l
-rw-r--r-- 1 bach bach 455087340 2010-01-27 20:40 REL8593A-5000000_out.ace -rw-r--r-- 1 bach bach 972479972 2010-01-27 20:38 REL8593A-5000000_out.caf -rw-r--r-- 1 bach bach 569619434 2010-01-27 20:38 REL8593A-5000000_out.maf -rw-r--r-- 1 bach bach 4708371 2010-01-27 20:39 REL8593A-5000000_out.padded.fasta -rw-r--r-- 1 bach bach 14125036 2010-01-27 20:39 REL8593A-5000000_out.padded.fasta.qual -rw-r--r-- 1 bach bach 472618709 2010-01-27 20:39 REL8593A-5000000_out.tcs -rw-r--r-- 1 bach bach 4707025 2010-01-27 20:39 REL8593A-5000000_out.unpadded.fasta -rw-r--r-- 1 bach bach 14120999 2010-01-27 20:39 REL8593A-5000000_out.unpadded.fasta.qual -rw-r--r-- 1 bach bach 13862715 2010-01-27 20:39 REL8593A-5000000_out.wig
You can see that MIRA has created output in many different formats suited for a number of different applications. Most commonly known will be ACE and CAF for their use in finishing programs (e.g. gap4 and consed).
In a different directory (the info directory) there are also files containing all sorts of statistics and useful information.
arcadia:.../REL8593A-5000000_d_results$
cd ../REL8593A-5000000_d_info/
arcadia:.../REL8593A-5000000_d_info$
ls -l
-rw-r--r-- 1 bach bach 2256 2010-01-27 20:40 REL8593A-5000000_info_assembly.txt -rw-r--r-- 1 bach bach 124 2010-01-27 20:04 REL8593A-5000000_info_callparameters.txt -rw-r--r-- 1 bach bach 37513 2010-01-27 20:37 REL8593A-5000000_info_consensustaglist.txt -rw-r--r-- 1 bach bach 28522692 2010-01-27 20:37 REL8593A-5000000_info_contigreadlist.txt -rw-r--r-- 1 bach bach 176 2010-01-27 20:37 REL8593A-5000000_info_contigstats.txt -rw-r--r-- 1 bach bach 15359354 2010-01-27 20:40 REL8593A-5000000_info_debrislist.txt -rw-r--r-- 1 bach bach 45802751 2010-01-27 20:37 REL8593A-5000000_info_readtaglist.txt
Just have a look at them to get a feeling what they show. You'll find more information regarding these files in that main manual of MIRA. At the moment, let's just have a quick assessment of the differences between the Lenski reference strain and the REL8593A train by counting how many SNPs MIRA thinks there are (marked with SROc tags in the consensus):
arcadia:.../REL8593A-5000000_d_info$
grep -c SROc REL8593A-5000000_info_consensustaglist.txt
102
102 bases are marked with such a tag. You will later see that this is an overestimation due to several insert sites and deletions, but it's a good first approximation.
Let's count how many potential deletion sites REL8593A has in comparison to the reference strain:
arcadia:.../REL8593A-5000000_d_info$
grep -c MCVc REL8593A-5000000_info_consensustaglist.txt
48
This number too is a slight overestimation due to cross-contamination with sequenced strain which did not have these deletions, but it's also a first approximate.
To have a look at your project in gap4, use the caf2gap program (you can get it at the Sanger Centre), and then gap4:
arcadia:.../REL8593A-5000000_d_results$
ls -l
-rw-r--r-- 1 bach bach 455087340 2010-01-27 20:40 REL8593A-5000000_out.ace -rw-r--r-- 1 bach bach 972479972 2010-01-27 20:38 REL8593A-5000000_out.caf -rw-r--r-- 1 bach bach 569619434 2010-01-27 20:38 REL8593A-5000000_out.maf -rw-r--r-- 1 bach bach 4708371 2010-01-27 20:39 REL8593A-5000000_out.padded.fasta -rw-r--r-- 1 bach bach 14125036 2010-01-27 20:39 REL8593A-5000000_out.padded.fasta.qual -rw-r--r-- 1 bach bach 472618709 2010-01-27 20:39 REL8593A-5000000_out.tcs -rw-r--r-- 1 bach bach 4707025 2010-01-27 20:39 REL8593A-5000000_out.unpadded.fasta -rw-r--r-- 1 bach bach 14120999 2010-01-27 20:39 REL8593A-5000000_out.unpadded.fasta.qual -rw-r--r-- 1 bach bach 13862715 2010-01-27 20:39 REL8593A-5000000_out.wigarcadia:.../REL8593A-5000000_d_results$
caf2gap -project REL8593A -ace REL8593A-5000000_out.caf >&/dev/null
arcadia:.../REL8593A-5000000_d_results$
ls -l
-rw-r--r-- 1 bach bach 1233494048 2010-01-27 20:43 REL8593A.0 -rw-r--r-- 1 bach bach 233589448 2010-01-27 20:43 REL8593A.0.aux -rw-r--r-- 1 bach bach 455087340 2010-01-27 20:40 REL8593A-5000000_out.ace -rw-r--r-- 1 bach bach 972479972 2010-01-27 20:38 REL8593A-5000000_out.caf -rw-r--r-- 1 bach bach 569619434 2010-01-27 20:38 REL8593A-5000000_out.maf -rw-r--r-- 1 bach bach 4708371 2010-01-27 20:39 REL8593A-5000000_out.padded.fasta -rw-r--r-- 1 bach bach 14125036 2010-01-27 20:39 REL8593A-5000000_out.padded.fasta.qual -rw-r--r-- 1 bach bach 472618709 2010-01-27 20:39 REL8593A-5000000_out.tcs -rw-r--r-- 1 bach bach 4707025 2010-01-27 20:39 REL8593A-5000000_out.unpadded.fasta -rw-r--r-- 1 bach bach 14120999 2010-01-27 20:39 REL8593A-5000000_out.unpadded.fasta.qual -rw-r--r-- 1 bach bach 13862715 2010-01-27 20:39 REL8593A-5000000_out.wigarcadia:.../REL8593A-5000000_d_results$
gap4 REL8593A.0
Search for the tags set by MIRA which denoted features or problems (SROc, WRMc, MCVc, UNSc, IUPc. See main manual for full list) in the assembly, and edit accordingly. Save your gap4 database as a new version (e.g. REL8593A.1), then exit gap4.
Then use the gap2caf command (also from the Sanger Centre) to convert the gap4 database back to CAF.
arcadia:.../REL8593A-5000000_d_results$
gap2caf -project REL8593A.1 >rel8593a_edited.caf
As gap4 jumbled the consensus (it does not know different sequencing
technologies), having convert_project recalculate the consensus
(with the "-r c
" option) is generally a good
idea.
arcadia:.../REL8593A-5000000_d_results$
convert_project -f caf -t caf -r c rel8593a_edited.caf rel8593a_edited_recalled
You will have to use either CAF or MAF as input, either of which can be the direct result from the MIRA assembly or an already cleaned and edited file. For the sake of simplicity, we'll use the file created by MIRA in the steps above.
Let's start with a HTML file showing all positions of interest:
arcadia:.../REL8593A-5000000_d_results$
convert_project -f caf -t hsnp REL8593A-5000000_out.caf rel8593a
arcadia:.../REL8593A-5000000_d_results$
ls -l *html
-rw-r--r-- 1 bach bach 5198791 2010-01-27 20:49 rel8593a_info_snpenvironment.html
But MIRA can do even better: create tables ready to be imported in spreadsheet programs.
arcadia:.../REL8593A-5000000_d_results$
convert_project -f caf -t asnp REL8593A-5000000_out.caf rel8593a
arcadia:.../REL8593A-5000000_d_results$
ls -l rel8593a*
-rw-r--r-- 1 bach bach 25864 2010-01-27 20:48 rel8593a_info_featureanalysis.txt -rw-r--r-- 1 bach bach 12402905 2010-01-27 20:48 rel8593a_info_featuresequences.txt -rw-r--r-- 1 bach bach 954473 2010-01-27 20:48 rel8593a_info_featuresummary.txt -rw-r--r-- 1 bach bach 5198791 2010-01-27 20:49 rel8593a_info_snpenvironment.html -rw-r--r-- 1 bach bach 13810 2010-01-27 20:47 rel8593a_info_snplist.txt
Have a look at all file, perhaps starting with the SNP list, then the feature analysis, then the feature summary (your biologists will love that one, especially when combined with filters in the spreadsheet program) and then the feature sequences.
This is actually quite straightforward if you name your reads according to the MIRA standard for input files. Assume you have the following files (bchocse being an example for your mnemonic for the project):
arcadia:/path/to/myProject$
ls -l
-rw-r--r-- 1 bach users 263985896 2008-03-28 21:49 bchocse_in.solexa.fastq
Here's the simplest way to start the assembly:
arcadia:/path/to/myProject$
mira --project=bchocse --job=denovo,genome,accurate,solexa >&log_assembly.txt
Of course, you can add any other switch you want like, e.g., changing the number of processors used, adding default strain names etc.pp
If you have only one library with one insert size, you just need to tell MIRA this minimum and maximum distance the reads should be away from each other. In the following example I have a library size of 500 bp and have set the minimum and maximum distance to +/- 50% (you might want to use other modifiers):
arcadia:/path/to/myProject$
mira --project=bchocse --job=denovo,genome,accurate,solexa SOLEXA_SETTINGS -GE:tismin=250:tismax=750 >&log_assembly.txt
![]() | Note |
---|---|
For this example to work, make sure that the read pairs are named using the Solexa standard, i.e., having /1 for one read and /2 for the other read. If yours have a different naming scheme, look up the -LR:rns parameter in the main documentation. |
To tell MIRA exactly which reads have which insert size, one must use an XML file containing ancillary data in NCBI TRACEINFO format. In case you don't have such a file, here's a very simple example containing only insert sizes for reads (lane 1 has a library size of 500 bases and lane 2 a library size of 2 Kb):
<?xml version="1.0"?> <trace_volume> <trace> <trace_name>1_17_510_1281/1</trace_name> <insert_size>500</insert_size> <insert_stdev>100</insert_stdev> </trace> <trace> <trace_name>1_17_510_1281/2</trace_name> <insert_size>500</insert_size> <insert_stdev>100</insert_stdev> </trace> ... <trace> <trace_name>2_17_857_850/1</trace_name> <insert_size>2000</insert_size> <insert_stdev>300</insert_stdev> </trace> <trace> <trace_name>2_17_857_850/2</trace_name> <insert_size>2000</insert_size> <insert_stdev>300</insert_stdev> </trace> ... </trace_volume>
So, if your directory looks like this:
arcadia:/path/to/myProject$
ls -l
-rw-r--r-- 1 bach users 263985896 2008-03-28 21:49 bchocse_in.solexa.fastq -rw-r--r-- 1 bach users 324987513 2008-04-01 13:24 bchocse_traceinfo_in.solexa.xml
then starting the assembly is done like this (note the additional [-LR:mxti] parameter in the section for Solexa setting):
arcadia:/path/to/myProject$
mira --project=bchocse --job=denovo,genome,accurate,solexa SOLEXA_SETTINGS -LR:mxti=yes >&log_assembly.txt
Two strategies can be thought of to assemble genomes using a combination of Solexa and other (longer) reads: either using all reads for a full de-novo assembly or first assembling the longer reads and use the resulting assembly as backbone to map Solexa reads. Both strategies have their pro and cons.
Throwing all reads into a de-novo assembly is the most straightforward way to get 'good' assemblies. This strategy is also the one which - in most cases - yields the longest contigs as, in many projects, parts of a genome not covered by one sequencing technology will probably be covered by another sequencing technology. Furthermore, having the consensus covered by more than one sequencing technology make base calling a pretty robust thing: if MIRA finds disagreements it cannot resolve easily, the assembler at least leaves a tag in the assembly to point human finishers to these positions of interest.
The downside of this approach however is the fact that the sheer amount of data in Solexa sequencing projects makes life difficult for de-novo assemblers, especially for MIRA which is keeping quite some additional information in memory in de-novo assemblies and tries to use algorithms as exact as possible during contig construction. Therefore, MIRA sometimes still runs into data sets which make it behave quite badly with respect to assembly time and memory consumption (but this is being constantly improved).
Full de-novo hybrid assemblies can be recommended only for bacteria at the moment, although lower eukaryotes should also be feasible on larger machines.
Starting the assembly is now just a matter of a simple command line with some parameters set correctly. The following is a de-novo hybrid assembly with 454 and Solexa reads.
arcadia:/path/to/myProject$
mira --project=bchocse --job=denovo,genome,normal,454,solexa >&log_assembly.txt
This strategy works in two steps: first assembling long reads, then mapping short reads to the full alignment (not just a consensus sequence). The result will be an assembly containing 454 (or Sanger) and Solexa reads.
Assemble your data just as you would when assembling 454 or Sanger data.
This step fetches 'long' contigs from the assembly before. Idea is to get all contigs larger than 500 bases.
$
convert_project -f caf -t caf -x 500 assemblyresult.caf hybrid_backbone_in.caf
You might eventually want to add an additional filter for minimum average coverage. If your project has an average coverage of 24, you should filter for a minimum average coverage of 33% (coverage 8, you might want to try out higher coverages) like this:
$
convert_project -f caf -t caf -x 500 -y 8 assemblyresult.caf hybrid_backbone_in.caf
Copy the hybrid backbone to a new empty directory, add in the Solexa data, start a mapping assembly using the CAF as input for the backbone. If you assembled the 454 / Sanger data with strain info, the Solexa data should also get those (as described above).
arcadia:/path/to/myProject$
ls -l
-rw-r--r-- 1 bach bach 1159280980 2009-10-31 19:46 hybrid_backbone_in.caf -rw-r--r-- 1 bach bach 338430282 2009-10-31 20:31 hybrid_in.solexa.fastqarcadia:/path/to/myProject$
mira --project=hybrid --job=mapping,genome,accurate,solexa -AS:nop=1 -SB:bft=caf >&log_assembly.txt
This section is a bit terse, you should also read the chapter on working with results of MIRA3.
When working with resequencing data and a mapping assembly, I always load finished projects into an assembly editor and perform a quick cleanup of the results.
For close relatives of the reference strain this doesn't take long as MIRA will have set tags (see section earlier in this document) at all sites you should have a look at. For example, very close mutant bacteria with just SNPs or simple deletions and no genome reorganisation, I usually clean up in 10 to 15 minutes. That gives the last boost to data quality and your users (biologists etc.) will thank you for that as it reduces their work in analysing the data (be it looking at data or performing wet-lab experiments).
Assume you have the following result files in the result directory of a MIRA assembly:
arcadia:/path/to/myProject/newstrain_d_results$
ls -l
-rw-r--r-- 1 bach bach 312607561 2009-06-08 14:57 newstrain_out.ace -rw-r--r-- 1 bach bach 655176303 2009-06-08 14:56 newstrain_out.caf ...
The general workflow I use is to convert the CAF file to a gap4 database and start the gap4 editor:
arcadia:newstrain_d_results$
caf2gap -project NEWSTRAIN -ace newstrain_out.caf >& /dev/null
arcadia:newstrain_d_results$
gap4 NEWSTRAIN.0
Then, in gap4, I
quickly search for the UNSc and WRMc tags and check whether they could be real SNPs that were overseen by MIRA. In that case, I manually set a SROc (or SIOc) tag in gap4 via hotkeys that were defined to set these tags.
sometimes also quickly clean up reads that are causing trouble in
alignments and lead to wrong base calling. These can be found at
sites with UNSc tags, most of the time they have the 5' to 3'
GGCxG
motif which can cause trouble to Solexa.
look at sites with deletions (tagged with MCVc) and look whether I should clean up the borders of the deletion.
After this, I convert the gap4 database back to CAF format:
$
gap2caf -project NEWSTRAIN >newstrain_edited.caf
But beware: gap4 does not have the same consensus calling routines as MIRA and will have saved it's own consensus in the new CAF. In fact, gap4 performs rather badly in projects with multiple sequencing technologies. So I use convert_project from the MIRA package to recall a good consensus (and save it in MAF as it's more compact and a lot faster in handling than CAF):
$
convert_project -f caf -t maf -r c newstrain_edited.caf newstrain_edited_recalled
And from this file I can then convert with convert_project to any other format I or my users need: CAF, FASTA, ACE, WIG (for coverage analysis) etc.pp.
I can also also generate tables and HTML files with SNP analysis
results (with the "-t asnp
" and "-t
hsnp
" options of convert_project)
As the result file of MIRA de-novo assemblies contains everything down to 'contigs' with just two reads, it is advised to first filter out all contigs which are smaller than a given size or have a coverage lower than 1/3 to 1/2 of the overall coverage.
Filtering is performed by convert_project using CAF file as input. Assume you have the following file:
arcadia:/path/to/myProject/newstrain_d_results$
ls -l
... -rw-r--r-- 1 bach bach 655176303 2009-06-08 14:56 newstrain_out.caf ...
Let's say you have a hybrid assembly with an average coverage of 50x. I normally filter out all contigs which have an average coverage less than 1/3 and are smaller than 500 bases. These are mostly junk contiglets remaining from the assembly and can be more or less safely ignored. This is done the following way:
arcadia:newstrain_d_results$
convert_project -f caf -t caf -x 500 -y 17 newstrain_out.caf newstrain_filterx500y17
From there on, convert the filtered CAF file to anything you need to continue finishing of the genome (gap4 database, ACE, etc.pp).
These are actual for version 3 of MIRA and might or might not have been addressed in later version.
Bugs:
mapping of paired-end reads with one read being in non-repetitive area and the other in a repeat is not as effective as it should be. The optimal strategy to use would be to map first the non-repetitive read and then the read in the repeat. Unfortunately, this is not yet implemented in MIRA.
Problems:
the textual output of results is really slow with such massive amounts of data as with Solexa projects. If Solexa data is present, it's turned off by default at the moment.