Table of Contents
“New problems demand new solutions. New solutions create new problems. ” | ||
--Solomon Short |
This is here to set the stage: at the moment, MIRA can only make use of PacBio reads which have an error rate of roundabout 5%. This means you will have to use either:
CCS (Circular Consensus Sequence) reads with at least 3 to 4 passes.
CLR (Continuous Long Reads) which were error corrected either with PacBio CCS reads or some other high-quality sequencing technology (Illumina comes to mind)
Impatients can directly jump to Section 6, “Walkthroughs: real data sets from PacBio” which contains walkthroughs using data made publicly available by PacBio.
When I developed the routines for PacBio, I had no access to their data and it shows. Now that first numbers regarding their reads get published and the first data sets available publicly, I realise a couple of preconditions were not met. Especially the fact that raw PacBio CLR reads seem to have an error rate between 1-in-5 (80% correct) and 1-in-7 (85% correct) means that MIRA will probably not operate very well with those. One should have something between 1-in-12 to 1-in-20 as error rate (92% to 95% correct) to get MIRA working happily.
During the course of 2011, PacBio has made available on their DevNet site quite a number of documents and introductory videos. A must read for everyone interested in this sequencing technology.
![]() | Note |
---|---|
As of MIRA 3.4.0, large parts of this documentation are still from a time where things like terminology, sequencing specifics etc.pp were not available publicly. It'll take same time for me to convert the guide. |
Pacific Biosciences looks like the new kid on the block of sequencing technologies. They seem to have, for the first time since Sanger sequencing, something which is able to produce sequences which are actually longer than Sanger. They also have something new: strobed sequencing. That technique alone was reason enough for me to see whether it could be of any use. After a couple of modifications to the MIRA assembly engine, I think I can say that "yes, it very well can be."
One could feed strobed PacBio sequences to MIRA 3.0.0 and the 2.9.x line before and get some results out of it by faking them to be Sanger, though the results were not always pretty.
The first version of MIRA to officially support sequences from Pacific Biosciences is MIRA 3.2. Versions in the 3.0.1 to 3.0.5 range and 3.1.x had different degrees of support, but were never advertised having it.
I am not affiliated with Pacific Biosciences nor do I -- unfortunately -- have early access to their data. Due to extreme secrecy, almost no one outside the company has actually seen their sequencing data. So some of what this guide contains is a bit of guesswork, reading through dozens and dozens of conference reports, blogs, press releases, tweets and whatever not.
But maybe I got some things right.
This guide assumes that you have basic working knowledge of Unix systems, know the basic principles of sequencing (and sequence assembly) and what assemblers do.
While there are step by step walk-throughs on how to setup your data for Sanger, 454 and Solexa in other MIRA guides, this guide is (currently) a bit more terse. You are expected to read at some point in time
the mira_reference help file to look up some command line options.
for hybrid assemblies of PacBio data with Sanger, 454, Solexa the corresponding mira_usage, mira_454 or mira_solexa help files to look up how to prepare the different data sets.
Let's first have a look at what sequencing (either paired or unpaired) meant until now. I won't go into the details of conventional sequencing as this is covered elsewhere in the MIRA manuals (and in the Web).
In conventional, unpaired sequencing, you have a piece of DNA (a
DNA template) which a machine reads out and then
gives you the sequence back. Assume your piece of DNA to be 10
kilo-bases long, but your machine can read only 1000 bases. Then what
you get back (DNA
below is the DNA template,
R1
is a read) is this:
DNA: actgttg...gtgcatgctgatgactgact.........gactgtgacgtactgcttga...actggatctg R1 : actgttg...gtgcatgct \_________________/ | ~1000 bases
In conventional paired-end sequencing, you still can read only 1000 bases, but you can do it at the beginning and at the end of a DNA template. This looks like that:
DNA: actgttg...gtgcatgctgatgactgact.........gactgtgacgtactgcttga...actggatctg R1 : actgttg...gtgcatgct \_________________/ R2 : gcttga...actggatctg | \_________________/ ~1000 bases | ~1000 bases
While you still have just two reads of approximately 1000 bases, you know one additional thing: these two reads are approximately 10000 bases apart. This additional information is very useful in assembly as it helps to resolve problematic areas.
Enter Pacific Biosciences with their strobed sequencing. With this approach, you can sequence also a given number of bases (they claim between 1000 and 3000), but you can sort of "distribute" the bases you want to read across the DNA template.
![]() | Warning |
---|---|
Overly simplified and probably totally inaccurate description ahead! Furthermore, the extremely short read and gap lengths in these examples serve only for demonstration purposes. |
Here's a simple example: assume you could read around 40 bases with your machinery, but that the DNA template is some ~80 bases. And assume you could tell your machine to read between 6 and 8 bases at a time, then leave out the next 6 to 8 bases, then read again etc. Like so:
DNA: actgttggtgcatgctgatgactgactgactgtgacgtacttgactgactggatctgtgactgactgtgactgactg R1a: actgttg R1b: gatgactgac R1c: cgtacttga R1d: atctgtgac R1e: gactgactg
While in the example above we still read only 44 bases, these 44 bases span 77 bases on the DNA template. Furthermore, we have the additional information that the sequence of reads is R1a, R1b, R1c, R1d and R1e and, because we asked the machine to read in such a pattern, we expect the gaps between the reads to be between 6 and 8 bases wide.
This is actually possible with the system of PacBio. It streams the DNA template through a detection system which reads out the bases only, and only if, a light source (a laser) is switched on. Therefore, while streaming the template through the system, you read the DNA while the laser is on and you don't read anything while it's off ... meanwhile the template is still streamed through.
Now, why would one want to turn the laser off?
It seems as if the light source is actually also the major limitation factor, as it has as nasty side-effect the degradation of DNA it should still read. A real bummer: after 1000 to 3000 bases (sometimes more, sometimes less), the DNA you read is probably so degraded and error ridden (eventually even physically broken) that it makes no sense to continue reading.
Here comes the trick: instead of reading, say, 1000 bases in a row, you can read them in strobes: you switch the light on and start reading a couple of bases (say: 100), switch the light off, wait a bit until some bases (again, let's say approximately 100) have passed by, switch the light back on and read again ~100 bases, then switch off ... etc.pp until you have read your 1000 bases, or, more likely, as long as you can. But, as shown in the example above, these 1000 bases will be distributed across a much larger span on the original DNA template: in a pattern of ~100 bases read and ~100 bases not read, the smaller read-fragments span ~1800 to ~2000 bases.
Cool ... this is actually something an assembler can make real use of.
A more conventional approach could be: you switch the light on and start reading a couple of bases (say: 500), switch the light off, wait a bit until some bases (again, let's say approximately 10000) have passed by, switch the light back on and read again ~500 bases. This would be equivalent to a "normal" paired-end read with an insert size of 11Kb. But assemblers also can make good use of that.
Although Pacific Biosciences keeps pretty quiet on this topic, missed bases seem to be quite a problematic point. A bit like the 454 homopolymer problem but without homopolymers. From http://scienceblogs.com/geneticfuture/2010/02/pacific_biosciences_session_at.php
“Turner [the presenter from PacBio] said nothing concrete about error rates during his presentation, but this issue dominated the questions from the audience. Turner skilfully equivocated, steering clear of providing any hard numbers on the raw error rates and focusing on the system's ability to generate accurate consensus sequences through circular reads. Still, it's clear that deletion errors due to missing bases will pose a non-trivial problem for the system: Turner referred to algorithms for assembling sequence dominated by insertion/deletion errors currently in development.”
Someone else made a nice comment on this (from http://omicsomics.blogspot.com/2010/02/pacbios-big-splash.html):
“Well, not much on error rates from PacBio (apparently in the Q&A their presenter executed a jig, tango, waltz & rumba when asked).”
Astute readers will have noted that in the section on sequencing a DNA template with PacBio, I wrote “approximately” when defining the length of the stretch of non-read bases in strobed sequencing. According to conference reports, the length can only be estimated with a variance of 10-20%. From http://www.genomeweb.com/sequencing/pacbio-says-strobe-sequencing-increases-effective-read-length-single-molecule-se:
“There is uncertainty regarding the size of the "dark" inserts, owing to "subtle fluctuations" in the DNA synthesis speed, he said, but it becomes smaller with longer inserts. For example, with 400-base inserts, the coefficient of variation of its size is 20 percent, but it decreases to 10 percent with 1,600 bases.”
The reports are a bit contradictory regarding achievable read lengths. While PacBio mentions they have attained read length of up to 20Kb in their labs and expect to be able to go to up to 50Kb, the first generation machines are marketed with much lower expectations. However, "much lower" in this context still means: at least 1 Kb and very good chances to have a good percentage of reads in the 3 to 5 Kb range.
The strobed sequencing method should allow to do a couple of interesting things. First of, simulate conventional paired-end sequencing. Then, going into real strobe sequencing, extending the length reads span over a DNA template by perhaps doubling or tripling the length will be extremely useful to cross most but the most annoying repeats one would encounter in prokaryotes ... and probably also eukaryotes once PacBio regularly achieves lengths of 10000 bases.
MIRA currently knows two ways to handle strobed reads:
a more traditional approach by using two strobes at a time as read pair
the "elastic dark insert" approach where all strobes are put in one
read and connected by stretches of N
representing
the dark inserts. "Elastic" means that -- the initial lengths of the
dark inserts being a rough estimate -- the length of the inserts are
then corrected in the iterative assembly passes of MIRA.
The elastic dark insert approach has an invaluable advantage: it keeps the small strobes connected in order in a read. This considerably reduces sources of errors when highly repetitive data is to be assembled where paired-end approaches also have limits.
Keeping the dark inserts as integral part of the reads however also poses a couple of problems. Traditional Smith-Waterman alignments are likely to give some of these alignments a bad score as there will invariably be a number of overlaps where the true length of the dark stretch is so different from the real length that an alignment algorithm needs to counterbalance this with a considerable number of inserted gaps. Which in turn can lower Smith-Waterman score to a level where needed identity thresholds are not met. The following example shows an excerpt of a case where a read with dark insert which length was estimated too low aligning against a read without dark insert:
...TGACTGA****************nnnnnnnnnnnnnnnnnnnnnTCAGTTGAT... ...TGACTGATGACTTTATCTATGGAGCTTATATGCGTCGAGCTTGGTCAGTTGAT...
While MIRA algorithms have methods to counterbalance this kind of scoring malus (e.g., by simply not counting gap scores in dark strobe inserts), another effect then appears: multiple alignment disorders. Like in many other assemblers the construction of multiple alignments is done iteratively by aggregating new reads to an existing contig by aligning it against a temporary consensus. As the misestimation of dark insert lengths can reach comparatively high numbers like 20 to 100 bases or more, problems can arise if several misestimated dark inserts in reads come together at one place. A simple example: assume the following scenario, where reads 1, 2, 3 and 4 get to form a contig by being assembled in exactly this order (1,2,3,4):
Read1 ...TGACTGAnnnnnnnnnnnnnnnnnnnnnTCAGTTGAT...
then
Read1 ...TGACTGA****************nnnnnnnnnnnnnnnnnnnnnTCAGTTGAT... Read2 ...TGACTGATGACTTTATCTATGGAGCTTATATGCGTCGAGCTTGGTCAGTTGAT...
then
Read1 ...TGACTGA****************nnnnnnnnnnnnnnnnnnnnnTCAGTTGAT... Read2 ...TGACTGATGACTTTATCTATGGAGCTTATATGCGTCGAGCTTGGTCAGTTGAT... Read3 ...TGACTGATGACTTTATCTATGGAGCTTATATGCGTCGAGCTTGGTCAGTTGAT...
then
Read1 ...TGACT*****GA****************nnnnnnnnnnnnnnnnnnnnnTCAGTTGAT... Read2 ...TGACT*****GATGACTTTATCTATGGAGCTTATATGCGTCGAGCTTGGTCAGTTGAT... Read3 ...TGACT*****GATGACTTTATCTATGGAGCTTATATGCGTCGAGCTTGGTCAGTTGAT... Read4 ...TGACTnnnnnnnnnnnnnnnnnnnnnnnnnnnnnATGCGTCGAGCTTGGTCAGTTGAT...
This can lead to severe misaligns in multiple alignments with several reads as the following screenshot shows exemplarily.
Figure 1. Multiple alignment with PacBio elastic dark inserts, initial status with severe misalignments
![]() |
However, MIRA is an iterative assembler working in multiple passes and iterations within a pass. This allows for a strategy of iterative correction of the estimation of dark length inserts. Like with every sequencing technology it knows, MIRA analyses the multiple alignment of a contig in several ways and searches for, e.g., misassembled repeats (for more information on this, please refer to the MIRA manual). When having reads with the technology from Pacific Biosciences, MIRA also analyses the elastic dark inserts whether or not their length as measured in the multiple alignment fits the estimated length. If not, the length of the dark insert will be corrected up or down for the next pass, the correction factor being of two thirds of the estimated difference between true and measured length of the dark insert.
Coming back to the example used previously:
Read1 ...TGACT*****GA****************nnnnnnnnnnnnnnnnnnnnnTCAGTTGAT... Read2 ...TGACT*****GATGACTTTATCTATGGAGCTTATATGCGTCGAGCTTGGTCAGTTGAT... Read3 ...TGACT*****GATGACTTTATCTATGGAGCTTATATGCGTCGAGCTTGGTCAGTTGAT... Read4 ...TGACTnnnnnnnnnnnnnnnnnnnnnnnnnnnnnATGCGTCGAGCTTGGTCAGTTGAT...
You will note that there are basically two elastic dark insert stretches. The first in read 1 has an underestimation of of the dark insert size of 16 bases, the second has an overestimation of five bases.
Accordingly, MIRA will add two thirds of 16 =
10 N
s to the estimated dark insert in read 1 and
remove 3 N
s (two thirds of 5) from read 4:
Read1 old ...TGACTGAnnnnnnnnnnnnnnnnnnnnnTCAGTTGAT... Read1 new ...TGACTGANNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnTCAGTTGAT... Read4 old ...TGACTnnnnnnnnnnnnnnnnnnnnnnnnnnnnnATGCGTCGAGCTTGGTCAGTTGAT... Read4 new ...TGACTnnnnnnnnnnnnnnnnnnnnnnnnnnATGCGTCGAGCTTGGTCAGTTGAT...
These new reads will be used in the next (sub-)passes of MIRA. Continuing the example from above, the next multiple alignment of all four reads would look like this:
Read1 ...TGACT**GA******NNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnTCAGTTGAT... Read2 ...TGACT**GATGACTTTATCTATGGAGCTTATATGCGTCGAGCTTGGTCAGTTGAT... Read3 ...TGACT**GATGACTTTATCTATGGAGCTTATATGCGTCGAGCTTGGTCAGTTGAT... Read4 ...TGACTnnnnnnnnnnnnnnnnnnnnnnnnnnATGCGTCGAGCTTGGTCAGTTGAT...
Again, the dark inserts would be corrected by MIRA, this time adding 4
N
s to read 1 and removing one N
from read 4., so that the next multiple alignment is this:
Read1 ...TGACT*GA**NNNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnTCAGTTGAT... Read2 ...TGACT*GATGACTTTATCTATGGAGCTTATATGCGTCGAGCTTGGTCAGTTGAT... Read3 ...TGACT*GATGACTTTATCTATGGAGCTTATATGCGTCGAGCTTGGTCAGTTGAT... Read4 ...TGACTnnnnnnnnnnnnnnnnnnnnnnnnnATGCGTCGAGCTTGGTCAGTTGAT...
From there it is trivial to see one just needs two more iterations to replace the initial estimated length of the dark insert by the true length of it. The next screenshot continues the live example shown previously after the second pass of MIRA (remember that each pass can have multiple sub-passes):
One pass (and multiple sub-passes) later, the elastic dark inserts in this example have reached their true lengths. The multiple alignment is as good as it can get as the following figure shows:
The elastic dark insert strategy is quite successful for resolving most problems but sometimes also fails to find a perfect solution. However, the remaining multiple alignment is -- in most cases -- good enough for a consensus algorithm to find the correct consensus as the next screenshot shows:
MIRA will happily read data in several different formats (FASTA, FASTQ, etc.). For the sake of simplicity, this guide will use FASTQ as demonstration format, but most of the time not add the quality line.
This is actually quite simple. Just put your reads as FASTQ in a file and you are done. No need to bother about read naming conventions or similar things. Like so:
@readname_001 ACGTTGCAGGGTCATGCAGT... @readname_002 ...
You have two possibilities for that. If the "dark insert" is not too
long and about the same size or shorter than the average sequenced
length at the ends, you can put the data into one read and fill up
the estimated dark insert length with the 'n
'
character. The following example shows this, using lengths of 10 for
the sequenced parts and a dark insert size of 10:
@readname_001 ACGTTGCAGGnnnnnnnnnnGTCATGCAGT @readname_002 ...
In case you have long "dark inserts", it is preferable to keep both
parts physically separated in different reads. In this case, read
naming becomes important for an assembler. Pick any paired-end
naming scheme you want and name your reads accordingly. The
following example shows the same data as above, split in two
reads and using the Solexa naming scheme to denote the first read of
a pair by appending /1
to the read name and the
second part by appending /2
:
@readname_001/1 ACGTTGCAGG @readname_001/2 GTCATGCAGT @readname_002/1 ...
![]() | Note |
---|---|
The example above used Solexa naming scheme to denote paired-end partner reads. You can use any naming scheme you want as long as MIRA knows it. E.g.: the forward/reverse or Sanger or TIGR naming schemes, you will just need to tell MIRA about it with the [-LR:rns] parameter. As soon as first data sets with PacBio are available, MIRA will also implement their naming scheme. |
Should all your reads approximately have the same total length of first part (/1) + dark insert + second part (/2), then you don't need to create an additional file with information about expected distance between the parts, you can use [-GE:tismin:tismax] to tell MIRA about it. In case you have different sizes because, e.g. you have sequenced different libraries, then you will need to tell MIRA which reads have which distance from each other. You can do this in an XML file in TRACEFORMAT (as defined by the NCBI). There will be other means in the future, but these have not been implemented yet.
Like in the case with two strobes, you have the choice between putting all strobes in one read ... or to separate the strobes in multiple reads. The following example shows the case where all strobes are in one read:
@readname_001 ACGTTGCAGGnnnnnnnnnnGTCATGCAGTnnnnnnnnnnnnnnnnnnnnnnnnTATGCACTGACnnnnnTAGCTGA @readname_002 ...
Note that the "dark inserts" do not necessarily need to be of the same length, even within a read. Indeed, depending on your sequencing strategy they can have very varying lengths, although one should take care that these inserts are not much longer than the longest strobes (or longest unstrobed read) in your data set.
In case you have long dark inserts, you should split the parts separated by these long inserts into different reads. E.g., if your strobes are 500 bases long, but separated by dark inserts > 1Kb, split them. You are free to split them however you like, in sub-pairs, in single strobes or whatever. For the example given above, this could be done like this:
@readname_001/1 ACGTTGCAGGnnnnnnnnnnGTCATGCAGT @readname_001/2 TATGCACTGACnnnnnTAGCTGA @readname_002 ...
Note the dark inserts remaining in each read of the "virtual" read-pair. The same sequences could also be split like this:
@readname_001a/1 ACGTTGCAGG @readname_001a/2 GTCATGCAGT @readname_001b/1 TATGCACTGAC @readname_001b/2 TAGCTGA @readname_002 ...
which would then be two read-pairs: the first and second strobes are paired, as well as the third and fourth. Here too, you can use any combination strobes to pair to each other (or to use without pair information).
Combining first and fourth strobe as well as second and fourth would look like this:
@readname_001a/1 ACGTTGCAGG @readname_001b/1 TAGCTGA @readname_001b/2 GTCATGCAGT @readname_001a/2 TATGCACTGAC @readname_002 ...
Note that in this case you probably need to provide paired-end information in a NCBI TRACEARCHIVE XML file to tell MIRA about the different insert sizes.
Finally, you can put the reads all in one template like this:
@readname_001.f1 ACGTTGCAGG @readname_001.f2 GTCATGCAGT @readname_001.f3 TATGCACTGAC @readname_001.f4 TAGCTGA @readname_002 ...
Note the subtle change in the naming of reads where I changed to a different postfix naming. This is because the Solexa naming scheme currently does not (officially) allow for more than two reads per DNA template (well, /1 and /2). The forward/reverse naming scheme like implemented by MIRA however does allow this.
This has just one drawback: currently MIRA will not be able to store the distances between the strobes when they are all in one template. This is being worked on and will be possible in a future version.
Create a directory where you copy your input data into (or where you set a soft-link where it really resides).
Currently (as of version 3.2.0) MIRA allows one input file per sequencing technology (one for Sanger, one for 454, one for Solexa and one for PacBio). This will change in the future, but for the moment it is how it is.
While you could name your input files whatever you like and pass these
as parameters to MIRA, it is easier to follow a simple naming scheme
that allows MIRA to find everything automatically. This scheme is
projectname_in.sequencingtechtype.filetypepostfix
The projectname
is a free
string which you decide to give to your project. The
sequencingtechtype
can be
one of "sanger", "454", "solexa" or "pacbio". Finally the
filetypepostfix
is either
"fasta" and "fasta.qual", "fastq" or any other type supported by MIRA.
Note that MIRA supports loading a lot of other information files (XML TRACEINFO, strain data etc.), please consult the reference manual for more information.
In the most basic incantation, you will need to tell MIRA just five things:
the name of your project.
whether you want a "genome" or "EST" assembly
whether it is a denovo or mapping assembly
which quality level (draft, normal or accurate)
which sequencing technologies are involved (sanger, 454, solexa, pacbio)
Using the most basic quick switches of MIRA, the command line for an accurate denovo genome with PacBio data then looks like this:
mira --project=yourname
--job=genome,denovo,accurate,pacbio
or for a hybrid PacBio and Solexa of the above:
mira --project=yourname
--job=genome,denovo,accurate,pacbio,solexa
![]() | Note |
---|---|
MIRA has -- at the last count -- more than 150 parameters one can use to fine tune almost every aspect of an assembly, from data loading options to results saving, from data preprocessing to results interpretation, from simple alignment parameters to parametrisation of internal misassembly decision rules ... and much more. Many of these parameters can be even set individually for each sequencing technology they apply to. Example given: in an assembly with Solexa, Sanger, 454 and PacBio data, the minimum read length for Solexa could be set to 30, while for 454 it could be 80, Sanger 100 and PacBio 150. Please refer to the reference manual for a full overview on how to use quick switches and extended switches. |
We'll use some data provided by PacBio for the E. coli O104:H4 outbreak in 2011, see http://www.pacbiodevnet.com/Share/Datasets/E-coli-Outbreak for more info.
That data set is quite interesting: PacBio took CLR reads (the reads with only ~85% accuracy) and mapped CCS reads (presumably >00% accuracy) to them to correct errors of the CLR reads. The resulting error corrected CLR data is of pretty good quality, not only from the quality values but when assembled, the number of sequencing errors in the reads which can be spotted in the aligments is obviously quite low.
Note: this is how I set up a project, feel free to implement whatever structure suits your needs.
$
mkdir c227-11-clrc
$
cd c227-11-clrc
arcadia:c227-11-clrc$
mkdir origdata data assemblies
Your directory should now look like this:
arcadia:c227-11-clrc$
ls -l
drwxr-xr-x 2 bach users 48 2011-08-19 20:21 assemblies drwxr-xr-x 2 bach users 48 2011-08-19 20:21 data drwxr-xr-x 2 bach users 48 2011-08-19 20:21 origdata
"c227-11-clrc" is an arbitrary name I just chose by concatenating the name of the bug and "-clrc" to indicate that this project has CLR sequences which were Corected. But you can name this whatever you want: foobar, blafurbsel, ...
Explanation of the structure:
the origdata
directory will contain the 'raw'
result files that one might get from sequencing. In our case it
will be the .tar.gz
file with the data in
FASTQ format from the Devnet site.
the data
directory will contain the
preprocessed sequences for the assembly, ready to be used by MIRA
the assemblies
directory will contain
assemblies we make with our data (we might want to make more than
one).
Head over to PacBio DevNet and fetch the data set for the E. coli
C227-11 CLR corrected data set. Put it into the
origdata
directory created a few moments ago.
Now, let's extract the data to the data
directory:
arcadia:c227-11-clrc$
cd data
arcadia:data$
tar xvzf ../origdata/e-coli-c227-11-corrected-fastq-1.2.2beta.tgz
e-coli-c227-11-corrected-fastq-1.2.2beta/ e-coli-c227-11-corrected-fastq-1.2.2beta/e-coli-c227-11-corrected.fastq
One thing you would quickly find out but which I tell now to save time: at the moment, PacBio seems to love ultra long read names. Here are the first 10 from the current data set:
arcadia:data$
grep ^@m e-coli-c227-11-corrected-fastq-1.2.2beta/e-coli-c227-11-corrected.fastq | head -10
@m110618_035655_42142_c100158802555500000315044108071130_s1_p0/10040/0_5174/c0 @m110618_035655_42142_c100158802555500000315044108071130_s1_p0/10040/0_5174/c1 @m110618_035655_42142_c100158802555500000315044108071130_s1_p0/1017/0_1636/c0 @m110618_035655_42142_c100158802555500000315044108071130_s1_p0/1054/0_4073/c0 @m110618_035655_42142_c100158802555500000315044108071130_s1_p0/1054/0_4073/c1 @m110618_035655_42142_c100158802555500000315044108071130_s1_p0/1054/4121_4891/c0 @m110618_035655_42142_c100158802555500000315044108071130_s1_p0/10548/0_5766/c0 @m110618_035655_42142_c100158802555500000315044108071130_s1_p0/10548/0_5766/c1 @m110618_035655_42142_c100158802555500000315044108071130_s1_p0/10640/0_2393/c0 @m110618_035655_42142_c100158802555500000315044108071130_s1_p0/11000/0_3285/c0
How sweet! File names with 80 and more characters AND having the "/" character as component, the later being a recipe for desaster sooner or later in some post-processing pipelines.
![]() | Note |
---|---|
MIRA has absolutely no problem with the above: neither with long read names nor with the "/" character in the name. However, long read names are a problem for example for gap4 (an assembly viewer) and the "/" character might lead to confusion with the standard UNIX directory separator. |
For the sake of simplicity and compatibility, let's rename all sequences. For this we'll use convert_project, which is a binary of the MIRA program package:
arcadia:data$
convert_project -f fastq -t fastq -R c227-11-clrc e-coli-c227-11-corrected-fastq-1.2.2beta/e-coli-c227-11-corrected.fastq c227-11-clrc_in.pacbio
Loading from fastq, saving to: fastq Loading data from FASTQ ... Counting sequences in FASTQ file: found 73496 sequences. Localtime: Sat Aug 20 20:36:26 2011 Unusual offset of 34, guessing this file to be a Sanger-type FASTQ format. Using calculated FASTQ quality offset: 33 Localtime: Sat Aug 20 20:36:26 2011 Loading data from FASTQ file: [0%] ....|.... [10%] ....|.... [20%] ....|.... [30%] ....|.... [40%] ....|.... [50%] ....|.... [60%] ....|.... [70%] ....|.... [80%] ....|.... [90%] ....|.... [100%] Done. Loaded 73496 reads, Localtime: Sat Aug 20 20:36:35 2011 done. Data conversion process finished, no obvious errors encountered.
![]() | Note |
---|---|
The above command has been split in multiple lines for better overview but should be entered in one line. |
The parameters to convert_project say
-f fastq: the "from" type (type of the input file) is FASTQ
-t fastq: the "to" type (type of the output file) should be FASTQ
-R c227-11-clrc: sequences should be renamed, all starting with "c227-11-clrc" and to which convert_project will append an underscore and a counter.
e-coli-c227-11-corrected-fastq-1.2.2beta/e-coli-c227-11-corrected.fastq: that's the full name of our input file.
c227-11-clrc_in.pacbio: that's
the partial name of our output file. Partial because
convert_project will automatically add the
postfix of the target format to the name, in this case
.fastq
.
Just in case you wonder why this works like this: convert_project can convert to multiple formats at once, e.g., like this: convert_project -f fastq -t fasta -t fastq -t maf ... and then you will get nicely named files.
Your directory should now look like this ...
arcadia:data$
ls -l
-rw-r--r-- 1 bach bach 257844076 2011-08-20 21:04 c227-11-clrc_in.pacbio.fastq drwxr-x--- 2 bach bach 4096 2011-07-22 04:49 e-coli-c227-11-corrected-fastq-1.2.2beta
... and as we do not need the subdirectory with the extracted data from PacBio anymore, let's get rid of it:
arcadia:data$
rm -rf e-coli-c227-11-corrected-fastq-1.2.2beta
arcadia:data$
ls -l
-rw-r--r-- 1 bach bach 257844076 2011-08-20 21:04 c227-11-clrc_in.pacbio.fastq
Perfect, we're done here.
Good, we're almost there. Let's switch to the
assembly
directory and create a subdirectory for our
first assembly test.
arcadia:data$
cd ../assemblies/
arcadia:assemblies$
mkdir 1sttest
arcadia:assemblies$
cd 1sttest
This directory is quite empty and the PacBio data is not present. Let's link to the file we just created in the previous step:
arcadia:1sttest$
ln -s ../../data/* .
arcadia:1sttest$
ls -l
lrwxrwxrwx 1 bach bach 39 2011-08-20 16:56 c227-11-clrc_in.pacbio.fastq -> ../../data/c227-11-clrc_in.pacbio.fastq
Starting the assembly is now just a matter of one line with some parameters set correctly:
arcadia:1sttest$
mira --project=c227-11-clrc --job=denovo,genome,accurate,pacbio >&log_assembly.txt
![]() | Note |
---|---|
The above command has been split in multiple lines for better overview but should be entered in one line. |
Some 3 to 4 hours later, you should have a nice and shiny assembly of your data.
Now, that was easy, wasn't it? In the above example - for assemblies having only PacBio data and if you followed the walkthrough on how to prepare the data - everything you might want to adapt in the first time are the following options:
--project (for naming your assembly project)
--job (perhaps to change the quality of the assembly to 'draft'
Of course, you are free to change any option via the extended parameters, perhaps change the default number of processors to use from 2 to 4 via [-GE:not=4] or any other of the > 150 parameters MIRA has ... but this is covered in the MIRA main reference manual.
There is a whole chapter in the manual dedicated to this, you are expected to read it :-)
However, for the impatient, here's a quick rundown on what I am going to show as example in this section: filtering of results and loading results into an assembly viewer.
This assembly project has an average coverage of roundabout 23 to 24x. Due to sequencing errors, MIRA will have also created a few small contigs with just a few reads and lot less coverage. These contigs are, well, most of the time not interesting as they contain junk most of the time. I want to get rid of them.
Let's say that only contigs ≥ 500 base pairs and with an average coverage ≥ 8 (which is 1/3 of the average coverage of 24 of the whole project) are interesting. Let's filter them out of the MIRA results and create a CAF file which can then be imported into either gap4 or gap5 (assembly viewers and finishing tools from the Staden package).
Let's take a quick look at the main directories and files of the assembly:
arcadia:1sttest$
ls -l
drwxr-xr-x 6 bach bach 4096 2011-08-20 16:57 c227-11-clrc_assembly lrwxrwxrwx 1 bach bach 39 2011-08-20 16:56 c227-11-clrc_in.pacbio.fastq -> ../../data/c227-11-clrc_in.pacbio.fastq -rw-r--r-- 1 bach bach 2524406 2011-08-20 19:46 log_assembly.txtarcadia:1sttest$
cd c227-11-clrc_assembly
arcadia:c227-11-clrc_assembly$
ls -l
drwxr-xr-x 2 bach bach 4096 2011-08-20 17:01 c227-11-clrc_d_chkpt drwxr-xr-x 2 bach bach 4096 2011-08-20 19:46 c227-11-clrc_d_info drwxr-xr-x 2 bach bach 4096 2011-08-20 20:10 c227-11-clrc_d_results drwxr-xr-x 2 bach bach 36864 2011-08-20 19:46 c227-11-clrc_d_tmparcadia:c227-11-clrc_assembly$
ls -l c227-11-clrc_d_info
-rw-r--r-- 1 bach bach 2320 2011-08-20 19:46 c227-11-clrc_info_assembly.txt -rw-r--r-- 1 bach bach 88 2011-08-20 16:57 c227-11-clrc_info_callparameters.txt -rw-r--r-- 1 bach bach 168049 2011-08-20 19:46 c227-11-clrc_info_consensustaglist.txt -rw-r--r-- 1 bach bach 1599427 2011-08-20 19:46 c227-11-clrc_info_contigreadlist.txt -rw-r--r-- 1 bach bach 6718 2011-08-20 19:46 c227-11-clrc_info_contigstats.txt -rw-r--r-- 1 bach bach 7401 2011-08-20 19:46 c227-11-clrc_info_debrislist.txt -rw-r--r-- 1 bach bach 10572 2011-08-20 19:15 c227-11-clrc_info_readrepeats.lst -rw-r--r-- 1 bach bach 42697892 2011-08-20 19:46 c227-11-clrc_info_readtaglist.txtarcadia:c227-11-clrc_assembly$
cd c227-11-clrc_d_results
arcadia:c227-11-clrc_results$
ls -l
-rw-r--r-- 1 bach bach 192615652 2011-08-20 19:46 c227-11-clrc_out.ace -rw-r--r-- 1 bach bach 597368574 2011-08-20 19:46 c227-11-clrc_out.caf -rw-r--r-- 1 bach bach 306918692 2011-08-20 19:46 c227-11-clrc_out.maf -rw-r--r-- 1 bach bach 6333500 2011-08-20 19:46 c227-11-clrc_out.padded.fasta -rw-r--r-- 1 bach bach 18987968 2011-08-20 19:46 c227-11-clrc_out.padded.fasta.qual -rw-r--r-- 1 bach bach 7240 2011-08-20 19:46 c227-11-clrc_out.tcs -rw-r--r-- 1 bach bach 6297565 2011-08-20 19:46 c227-11-clrc_out.unpadded.fasta -rw-r--r-- 1 bach bach 18881604 2011-08-20 19:46 c227-11-clrc_out.unpadded.fasta.qual -rw-r--r-- 1 bach bach 4442567 2011-08-20 19:46 c227-11-clrc_out.wig
OK, we're at the right spot for filtering. While we are at it, tell convert_project to not only convert to CAF, but also to write a new file with tabular data on contig statistics of the filtered contigs:
arcadia:c227-11-clrc_results$
convert_project -f maf -t caf -t cstats -x 500 -y 8 c227-11-clrc_out.maf c227-11-clrc_filteredx500y8
Loading from maf, saving to: caf cstats First counting reads: [0%] ....|.... [10%] ....|.... [20%] ....|.... [30%] ....|.... [40%] ....|.... [50%] ....|.... [60%] ....|.... [70%] ....|.... [80%] ....|.... [90%] ....|.... [100%] Now loading and processing data:
... lots of lines omitted ...
Data conversion process finished, no obvious errors encountered.arcadia:c227-11-clrc_results$
ls -l *filtered*
-rw-r--r-- 1 bach bach 584042411 2011-08-20 22:12 c227-11-clrc_filteredx500y8.caf -rw-r--r-- 1 bach bach 1518 2011-08-20 22:12 c227-11-clrc_filteredx500y8_info_contigstats.txt
arcadia:c227-11-clrc_results$
cat c227-11-clrc_filteredx500y8_info_contigstats.txt
# name length av.qual #-reads mx.cov. av.cov GC% CnIUPAC CnFunny CnN CnX CnGap CnNoCov c227-11-clrc_c1 335375 90 4081 35 21.27 49.73 2 0 0 0 1975 0 c227-11-clrc_c2 651370 90 9224 41 23.88 51.23 3 0 0 0 4535 0 c227-11-clrc_c3 356318 90 4962 40 24.18 50.81 0 0 0 0 2208 0 c227-11-clrc_c4 386288 90 5178 39 23.58 51.49 3 0 0 0 2367 0 c227-11-clrc_c5 908271 90 12277 40 23.41 50.73 3 0 0 0 5912 0 ...
Once filtered, how many contigs are there? Well, 22: it's the
number of lines minus one of the file
c227-11-clrc_filteredx500y8_info_contigstats.txt
:
arcadia:c227-11-clrc_results$
wc -l c227-11-clrc_filteredx500y8_info_contigstats.txt
23
I'm very fond of gap4, so I'll use it to show ho the assembly looks like:
arcadia:c227-11-clrc_results$
caf2gap -project c227-11 -ace c227-11-clrc_filteredx500y8.caf >&/dev/null
arcadia:c227-11-clrc_results$
ls -l C*
-rw-r--r-- 1 bach bach 543539856 2011-08-20 22:35 C227-11.0 -rw-r--r-- 1 bach bach 39512896 2011-08-20 22:35 C227-11.0.auxarcadia:c227-11-clrc_results$
gap4 C227-11.0
And while it's difficult to judge an assembly only from a screenshot, I made one: 22 contigs, none smaller than 4kb and pretty good certainty your bases are correct. Pray tell, isn't that beautiful?
Head over to PacBio DevNet and fetch the data set for the E. coli C227-11 CCS data set.
For the rest ... well, it's pretty much the same as for the CLR data
set. Just one little difference: in the .tgz
you
downloaded, PacBio has split the data set into multiple FASTQ files (for
whatever reason). You will need to concatenate them into one file
before starting to work with that. Yep, and that's it.
Whole genome sequencing of bacteria will probably be amongst the first for which the long PacBio reads will have an impact. Simply put: the repeat structure -- like rRNA stretches, (pro)phages and or duplicated genes/operons -- of bacteria is such that most genomes known so far can be assembled and or scaffolded with paired-end libraries between 6Kb and 10Kb. Cite paper ...!
Well, using strobed reads where a DNA template is sequenced in several strobes and the dark inserts have approximately the same length as a strobe, the initial PacBio data should be capable to generate strobed data from DNA templates a total span between 2000 and 6000 bases.
Furthermore, strobed reads can be used to generate traditional paired-end sequence with large insert sizes like 10Kb or more.
In the first few examples showing assembly with only PacBio data, we will use the genome of the Bacillus subtilis 168, which is a long standing model organism for systems biology and also used in biotechnology. From a complexity point of view, the genome has some interesting things in. As example, there are 11 rRNA stretches, some of them clustered together, which probably comes from the fact that Bsub evolved under laboratory conditions to become a fast grower. The most awful multiple rRNA cluster is the one starting at ...Kb and is ... Kb long.
The examples afterwards we will work with Escherichia coli ... (Eco), another model organism in the bacterial community. That time we will mix simulated low coverage PacBio data with real data from Solexa deposited at the NCBI Short read Archive (SRA).
![]() | Note |
---|---|
Currently this section contains examples with real Solexa reads but only simulated PacBio reads as I do not have early access to real PacBio data. However, I think that these examples show the possibilities such a technology could have. |
Everyone (or every sequencing group / center) has more or less an own standard on how to organise directories and data prior to an assembly. Here's how I normally do it and how the following examples will be -- more or less -- set up: one top directory with the name of the project containing three specific sub-directories; one for original data, one for eventually reformated data and one for assemblies. That looks a bit like this:
$
mkdir myproject
$
cd myproject
myproject$
mkdir origdata data assemblies
The origdata directory contains whatever data file (or links to those) I have for that project: sequencing files from the provider, reference genomes from databases etc.pp. The general rule: no other files, and these files are generally write protected and unchanged from the state of delivery.
The data directory contains the files as MIRA will want to use them, eventually reformatted or reworked or bundled together with other data. E.g.: if your provider delivered several data files with sequence data for PacBio, you currently need to combine them into one file as MIRA currently reads only one input file per sequencing technology.
The assemblies directory finally contains
sub-directories with different assembly trials I make. Every sub-directory
is quickly set-up by creating it, linking data files from the
data
directory to it and then start
MIRA. Continuing the example from above:
myproject$
cd assemblies
myproject/assemblies$
mkdir firstassembly
myproject/assemblies$
cd firstassembly
myproject/assemblies/firstassembly$
lndir ../../data
myproject/assemblies/firstassembly$
mira --project=...
That strategy keeps things nice and tidy in place and allows for a maximum flexibility while testing out a couple of settings.
Set up directories and fetch genome of Bacillus subtilis 168 from GenBank
$
mkdir bsubdemo1
$
cd bsubdemo1
bsubdemo1$
mkdir origdata data assemblies
bsubdemo1$
cd origdata
bsubdemo1/origdata$
wget ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Bacillus_subtilis/AL009126.fna
bsubdemo1/origdata$
ls -l
-rw-r--r-- 1 bach bach 4275918 2010-06-06 00:34 AL009126.fna
After that, we'll prepare the the simulated PacBio data by running a script which creates paired reads like we would expect from a sequencing with PacBio with the following properties: DNA templates have 10k bases or more, we sequence the first 1000 bases in a strobe, let approximately 8000 bases pass, then sequence another 1000 bases.
bsubdemo1/origdata$
cd ../data
bsubdemo1/data$
fasta2frag.tcl -l 1000 -i 230 -p 1 -insert_size 10000 -pairednaming 454 -P 0 -r 2 -infile ../origdata/AL009126.fna -outfile bs168pe1k_10k_in.pacbio.fasta
no ../origdata/AL009126.fna.qual fragging gi|225184640|emb|AL009126.3|bsubdemo1/data$
ls -l
-rw-r--r-- 1 bach bach 38971051 2010-06-06 18:27 bs168pe1k_10k_in.pacbio.fasta -rw-r--r-- 1 bach bach 1642208 2010-06-06 18:27 bs168pe1k_10k_in.pacbio.fasta.bambus -rw-r--r-- 1 bach bach 1440988 2010-06-06 18:27 bs168pe1k_10k_in.pacbio.fasta.pairs -rw-r--r-- 1 bach bach 111214374 2010-06-06 18:27 bs168pe1k_10k_in.pacbio.fasta.qual
The command line given above will create an artificial data set with
equally distributed PacBio "paired-end" reads with an average coverage
of 8.6 across the genome. Note that the *.bambus
and *.pairs
files are not needed for MIRA, but the
Tcl script generates these for some other use cases.
Next, we move to the assembly directory, make a new one to run a first assembly and link all the needed input files for MIRA into this new directory:
bsubdemo1/data$
cd ../assemblies
bsubdemo1/assemblies$
mkdir firsttest
bsubdemo1/assemblies$
cd firsttest
bsubdemo1/assemblies/firsttest$
ln -s ../../data/bs168pe1k_10k_in.pacbio.fasta .
bsubdemo1/assemblies/firsttest$
ln -s ../../data/bs168pe1k_10k_in.pacbio.fasta.qual .
bsubdemo1/assemblies/firsttest$
ls -l
lrwxrwxrwx 1 bach bach 39 2010-06-06 01:01 bs168pe1k_10k_in.pacbio.fasta -> ../../data/bs168pe1k_10k_in.pacbio.fasta lrwxrwxrwx 1 bach bach 44 2010-06-06 01:01 bs168pe1k_10k_in.pacbio.fasta.qual -> ../../data/bs168pe1k_10k_in.pacbio.fasta.qual
We're all set up now, just need to start the assembly:
bsubdemo1/assemblies/firsttest$
mira --project=bs168pe1k_10k --job=genome,denovo,accurate,pacbio --notraceinfo -GE:not=4 PACBIO_SETTINGS -GE:tpbd=1:tismin=9000:tismax=11000 -LR:rns=fr >&log_assembly.txt
The command above told MIRA
the name (bs168pe1k_10k
)
you chose for your project. MIRA will search input files with this
prefix as well as write output files and directories with that prefix.
the assembly job MIRA should perform. In this case a de-novo genome assembly at accurate level with PacBio data.
some additional information that MIRA should not search for additional ancillary information in NCBI TRACEINFO XML files
the number of threads which MIRA should run at most in parallel.
then tell MIRA that the following switches apply to reads in the assembly which are from Pacific Biosciences
both reads of a PacBio read-pair should assemble in the same direction in a contig.. the minimum distance between the outer read ends should be at minimum 9000 bases and at maximum 11000 bases.
the read naming scheme for a PacBio read-pair is "forward/reverse", i.e., the first read has ".f" appended to its name, the second read ".r".
the standard output of MIRA should be redirected to a file name log_assembly.txt
Some 12 to 13 minutes later, the data set will be assembled. Though you
should note that in real life projects with sequencing errors, MIRA will
take perhaps 3 to 4 times longer. Have a look at the information files
in directory bs168pe1k_10k_assembly /
bs168pe1k_10k_d_info/
, there especially to the files
bs168pe1k_10k_info_assembly.txt
and
bs168pe1k_10k_info_contigstats.txt
which give a
first overview on how the assembly went.
In short: this assembly went -- unsurprisingly -- quite well: the complete chromosome of Bacillus subtilis 168 has been reconstructed into one complete contig. There are just two minor flaws disturbing just a little bit. First, a few (twelve) repetitive reads could not be placed and form a second small contig of 2Kb. Second, the reconstructed chromosome contains 4 single-base differences with respect to the original Bsub chromosome. It is an exercise left to the reader to find out that this is due to almost identical rRNA repeats where two almost adjacent elements lie within the expected template insert size of the simulated PacBio Reads and therefore troubled the assembler a bit.
Your next stop would then be the directory
bs168pe1k_10k_assembly / bs168pe1k_10k_d_results/
which contains the assembly results in all kind of formats. If a format
you need is missing, have a look at convert_project
from the MIRA package, it may be that the format you need can be
generated with it.
Set up directories and fetch genome of Bacillus subtilis 168 from GenBank
$
mkdir bsubdemo2
$
cd bsubdemo2
bsubdemo2$
mkdir origdata data assemblies
bsubdemo2$
cd origdata
bsubdemo2/origdata$
wget ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Bacillus_subtilis/AL009126.fna
bsubdemo2/origdata$
ls -l
-rw-r--r-- 1 bach bach 4275918 2010-06-06 00:34 AL009126.fna
After that, we'll prepare the the simulated PacBio data by running a script which creates strobed reads like we would expect from a sequencing with PacBio with the following properties: DNA templates are 6k bases or more, we sequence the first ~100 bases in a strobe, let approximately 100 bases pass, and repeat until we have 3000 bases in strobes.
bsubdemo2/origdata$
cd ../data
bsubdemo2/data$
fasta2frag.tcl -l 3000 -i 150 -r 2 -s 1 -strobeon 100 -strobeoff 100 -infile ../origdata/AL009126.fna -outfile bs168_3ks_100_100_in.pacbio.fasta
no ../origdata/AL009126.fna.qual fragging gi|225184640|emb|AL009126.3|bsubdemo2/data$
ls -l
-rw-r--r-- 1 bach bach 166909136 2010-06-06 19:18 bs168_3ks_100_100_in.pacbio.fasta -rw-r--r-- 1 bach bach 416614472 2010-06-06 19:18 bs168_3ks_100_100_in.pacbio.fasta.qual
The command line given above will create an artificial data set with equally distributed PacBio strobed reads with an average coverage of ~20 across the genome, of which only half is filled with sequence data so that the "real" coverage is ~10.
Next, we move to the assembly directory, make a new one to run a first assembly and link all the needed input files for MIRA into this new directory:
bsubdemo2/data$
cd ../assemblies
bsubdemo2/assemblies$
mkdir firsttest
bsubdemo2/assemblies$
cd firsttest
bsubdemo2/assemblies/firsttest$
ln -s ../../data/bs168_3ks_100_100_in.pacbio.fasta .
bsubdemo2/assemblies/firsttest$
ln -s ../../data/bs168_3ks_100_100_in.pacbio.fasta.qual .
bsubdemo2/assemblies/firsttest$
ls -l
lrwxrwxrwx 1 bach bach 39 2010-06-06 01:01 bs168_3ks_100_100_in.pacbio.fasta -> ../../data/bs168_3ks_100_100_in.pacbio.fasta lrwxrwxrwx 1 bach bach 44 2010-06-06 01:01 bs168_3ks_100_100_in.pacbio.fasta -> ../../data/bs168_3ks_100_100_in.pacbio.fasta
We're all set up now, just need to start the assembly:
bsubdemo1/assemblies/firsttest$
mira --project=bs168_3ks_100_100 --job=genome,denovo,accurate,pacbio --notraceinfo --noclipping -GE:not=4 -GO:mr=no PACBIO_SETTINGS -AL:egp=no >&log_assembly.txt
The command above told MIRA
the name (bs168_3ks_100_100
)
you chose for your project. MIRA will search input files with this
prefix as well as write output files and directories with that prefix.
the assembly job MIRA should perform. In this case a de-novo genome assembly at accurate level with PacBio data.
some additional information that MIRA should not search for additional ancillary information in NCBI TRACEINFO XML files
the number of threads which MIRA should run at most in parallel.
a MIRA parameter called mark repeats should be switched off for PacBio reads. This is absolutely necessary when you have strobed reads with elastic dark inserts as MIRA otherwise gets somewhat confused due to alignment problems shown earlier in this guide.
then tell MIRA that the following switches apply to reads in the assembly which are from Pacific Biosciences
a MIRA parameter called extra gap penalty should be switched off for PacBio reads. This is necessary when you have strobed reads with elastic dark inserts as otherwise alignment problems with larger gaps lead to unnecessary rejection of alignments.
the standard output of MIRA should be
redirected to a file name
log_assembly.txt
Wait for approximately 4.5hrs for MIRA to complete. Using elastic dark inserts is a pretty expensive feature from a computation perspective: all the passes and sub-passes of MIRA to move from an estimated length to an actually correct value means to build and break apart all the contigs and start from anew.
Bad news first: looking at the results and info directories, you will see that one single contig with a length of 4199898 bases was created. The original B. subtilis genome we used for this walkthrough is 4215426 bases, so it looks like some 15.5Kb are "missing." But, and this is the good news, the contig which was created represents the B. subtilis genome pretty faithfully: a check with MUMMER confirms that no misassemblies respectively re-ordering event of genome elements occurred.
![]() | Note |
---|---|
The following will need MUMMER3 installed on your system. Fetch it here: http://mummer.sourceforge.net/ |
bsubdemo2/assemblies/firsttest$
cd bs168_3ks_100_100_assembly/bs168_3ks_100_100_d_results
../bs168_3ks_100_100_d_results$
ls -l
-rw-r--r-- 1 bach bach 280894715 2010-06-08 04:53 bs168_3ks_100_100_out.ace -rw-r--r-- 1 bach bach 776536315 2010-06-08 04:52 bs168_3ks_100_100_out.caf -rw-r--r-- 1 bach bach 461365272 2010-06-08 04:52 bs168_3ks_100_100_out.maf -rw-r--r-- 1 bach bach 4347658 2010-06-08 04:52 bs168_3ks_100_100_out.padded.fasta -rw-r--r-- 1 bach bach 13040564 2010-06-08 04:52 bs168_3ks_100_100_out.padded.fasta.qual -rw-r--r-- 1 bach bach 436189259 2010-06-08 04:53 bs168_3ks_100_100_out.tcs -rw-r--r-- 1 bach bach 4269919 2010-06-08 04:52 bs168_3ks_100_100_out.unpadded.fasta -rw-r--r-- 1 bach bach 12808422 2010-06-08 04:52 bs168_3ks_100_100_out.unpadded.fasta.qual -rw-r--r-- 1 bach bach 3203036 2010-06-08 04:53 bs168_3ks_100_100_out.wig../bs168_3ks_100_100_d_results$
nucmer -maxmatch -c 100 -p nucmer ../../../../origdata/AL009126.fna bs168_3ks_100_100_out.unpadded.fasta
1: PREPARING DATA 2,3: RUNNING mummer AND CREATING CLUSTERS [... some lines left out ...] 4: FINISHING DATA../bs168_3ks_100_100_d_results$
delta-filter -q -l 1000 nucmer.delta > nucmer.delta.q
../bs168_3ks_100_100_d_results$
show-coords -r -c -l nucmer.delta.q > nucmer.coords
../bs168_3ks_100_100_d_results$
cat nucmer.coords
NUCMER [S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [LEN R] [LEN Q] | [COV R] [COV Q] | [TAGS] =============================================================================================================================== 181 4215606 | 4199698 1 | 4215426 4199698 | 99.62 | 4215606 4199898 | 100.00 100.00 | gi|225184640|emb|AL009126.3| bs168_3ks_100_100_c1
As already said: not 100% perfect on a base by base basis, but good enough for a using as reference sequence in subsequent mapping assemblies to get all the bases right.
TO BE EXPANDED: no real walkthrough yest, just a few hints.
Prepare your PacBio data like explained in this guide.
Prepare your other data (Sanger, 454, Solexa, or any combination of it) like explained in the respective MIRA guides.
Start MIRA with, e.g.,
--job=denovo,genome,accurate,pacbio,solexa
(and
any other parameter you need) for a denovo genome assembly at
accurate level with PacBio and Solexa data.
For error-corrected CLR data, MIRA does not get the average coverage of a project correct: it underestimates it, sometimes by a factor of 10. This in turn leads to too many "large" contigs and subsequently to a N50 number which is way off the thruth.
Will be fixed asap.