Table of Contents
“Upset causes changes. Change causes upset.” | ||
--Solomon Short |
MIRA can assemble 454 type data either on its own or together with Sanger or Solexa type sequencing data (true hybrid assembly). Paired-end sequences coming from genomic projects can also be used if you take care to prepare your data the way MIRA needs it.
MIRA goes a long way to assemble sequence in the best possible way: it uses multiple passes, learning in each pass from errors that occurred in the previous passes. There are routines specialised in handling oddities that occur in different sequencing technologies
This guide assumes that you have basic working knowledge of Unix systems, know the basic principles of sequencing (and sequence assembly) and what assemblers do.
While there are step by step walkthroughs on how to setup your 454 data and then perform an assembly, this guide expects you to read at some point in time
the "Caveats when using 454 data" section of this document (just below). This. Is. Important. Read. It!
the mira_usage introductory help file so that you have a basic knowledge on how to set up projects in mira for Sanger sequencing projects.
the GS FLX Data Processing Software Manual from Roche Diagnostics (or the corresponding manual for the GS20 or Titanium instruments).
and last but not least the mira_reference help file to look up some command line options.
If you want to jump into action, I suggest you walk through the "Walkthrough: combined unpaired and paired-end assembly of Brucella ceti" section of this document to get a feeling on how things work. That particular walkthrough is with paired and unpaired 454 data from the NCBI short read archive, so be prepared to download a couple of hundred MiBs.
But please do not forget to come back to the "Caveats" section just below later, it contains a pointers to common traps lurking in the depths of high throughput sequencing.
Please take some time to read this section. If you're really eager to jump into action, then feel free to skip forward to the walkthrough, but make sure to come back later.
Or at least use the vector clipping info provided in the SFF file and have them put into a standard NCBI TRACEINFO XML format. Yes, that's right: vector clipping info.
Here's the short story: 454 reads can contain a kind of vector sequence. To be more precise, they can - and very often do - contain the sequence of the (A or B)-adaptors that were used for sequencing.
To quote a competent bioinformatician who thankfully dug through quite some data and patent filings to find out what is going on: "These adaptors consist of a PCR primer, a sequencing primer and a key. The B-adaptor is always in because it's needed for the emPCR and sequencing. If the fragments are long enough, then one usually does not reach the adaptor at all. But if the fragments are too short - tough luck."
Basically it's tough luck for a lot of 454 sequencing project I have seen so far, both for public data (sequences available at the NCBI trace archive) and non-public data.
![]() | Tip |
---|---|
Use the sff_extract script from Jose Blanca at the University of Valencia. The home of sff_extract is: http://bioinf.comav.upv.es/sff_extract/index.html but I am thankful to Jose for giving permission to distribute the script in the MIRA 3rd party package (separate download). |
Some labs use specially designed tags for their sequencing (I've heard of cases with up to 20 bases). The tag sequences always being very identical, they will behave like vector sequences in an assembly. Like for any other assembler: if you happen to get such a project, then you must take care that those tags are filtered out, respectively masked from your sequences before going into an assembly. If you don't, the results will be messy at best.
![]() | Tip |
---|---|
Put your FASTAs through SSAHA2 or better, SMALT with the sequence of your tags as masking target. MIRA can read the SSAHA2 output (or SMALT when using "-f ssaha" output) and mask internally using the MIRA [-CL:msvs] parameter and the options pertaining to it. |
Sequences coming from the GS20, FLX or Titanium have usually pretty good clip points set by the Roche/454 preprocessing software. There is, however, a tendency to overestimate the quality towards the end of the sequences and declare sequence parts as 'good' which really shouldn't be.
Sometimes, these bad parts toward the end of sequences are so annoyingly bad that they prevent MIRA from correctly building contigs, that is, instead of one contig you might get two.
MIRA has the [-CL:pec] clipping option to deal with these
annoyances (standard for all --job=genome
assemblies). This algorithm performs proposed end
clipping which will guarantee that the ends of reads are
clean when the coverage of a project is high enough.
For genomic sequences: the term 'enough' being somewhat fuzzy ... everything above a coverage of 15x should be no problem at all, coverages above 10x should also be fine. Things start to get tricky below 10x, but give it a try. Below 6x however, switch off the [-CL:pec] option.
"Do I have enough memory?" has been one of the most often asked question in former times. To answer this question, please use miramem which will give you an estimate. Basically, you just need to start the program and answer the questions, for more information please refer to the corresponding section in the main MIRA documentation.
Take this estimate with a grain of salt, depending on the sequences properties, variations in the estimate can be +/- 30%.
Take these estimates even with a larger grain of salt for eukaryotes. Some of them are incredibly repetitive and this leads currently to the explosion of some secondary tables in MIRA. I'm working on it.
The basic data type you will get from the sequencing instruments will be SFF files. Those files contain almost all information needed for an assembly, but they need to be converted into more standard files before mira can use this information.
Let's assume we just sequenced a bug (Bacillus chocorafoliensis) and internally our department uses the short bchoc mnemonic for your project/organism/whatever. So, whenever you see bchoc in the following text, you can replace it by whatever name suits you.
For this example, we will assume that you have created a directory
myProject
for the data of your project and that
the SFF files are in there. Doing a ls -lR
should
give you something like this:
arcadia:/path/to/myProject$
ls -lR
-rw-rw-rw- 1 bach users 475849664 2007-09-23 10:10 EV10YMP01.sff -rw-rw-rw- 1 bach users 452630172 2007-09-25 08:59 EV5RTWS01.sff -rw-rw-rw- 1 bach users 436489612 2007-09-21 08:39 EVX95GF02.sff
As you can see, this sequencing project has 3 SFF
files.
The basic data type you will get from the 454 sequencing instruments will be SFF files. Those files contain almost all information needed for an assembly, but SFFs need to be converted into more standard files before MIRA can use this information.
In former times this was done using 3 files (FASTA, FASTA quality and XML), but nowadays the FASTQ format is used almost everywhere, so we will need only two files: FASTQ for sequence + quality and XML for clipping information.
We'll use the sff_extract script to do that. We'll name the output files in a way that makes them immediately suitable for MIRA input.
Note 1: make sure you have Python installed on your system
Note 2: make sure you have the sff_extract script in your path (or use absolute path names)
arcadia:/path/to/myProject$
sff_extract -Q -s bchoc_in.454.fastq -x bchoc_traceinfo_in.454.xml EV10YMP01.sff EV5RTWS01.sff EVX95GF02.sff
![]() | Note |
---|---|
The above command has been split in multiple lines for better overview but should be entered in one line. |
The parameters to sff_extract tell it to extract to
FASTQ (via -Q), give the FASTQ file a name we chose (via -s), give the
XML file with clipping information a name we chose (via -x) and
convert the SFFs named EV10YMP01.sff
,
EV5RTWS01.sff
and
EVX95GF02.sff
.
This can take some time, the 1.2 million FLX reads from this example need approximately 9 minutes for conversion. Your directory should now look something like this:
arcadia:/path/to/myProject$
ls -l
-rw-r--r-- 1 bach users 231698898 2007-10-21 15:16 bchoc_in.454.fastq -rw-r--r-- 1 bach users 193962260 2007-10-21 15:16 bchoc_traceinfo_in.454.xml -rw-rw-rw- 1 bach users 475849664 2007-09-23 10:10 EV10YMP01.sff -rw-rw-rw- 1 bach users 452630172 2007-09-25 08:59 EV5RTWS01.sff -rw-rw-rw- 1 bach users 436489612 2007-09-21 08:39 EVX95GF02.sff
By this time, the SFFs are not needed anymore. You can remove them from this directory if you want.
Starting the assembly is now just a matter of one line with some parameters set correctly:
arcadia:/path/to/myProject$
mira --project=bchoc --job=denovo,genome,accurate,454 >&log_assembly.txt
![]() | Note |
---|---|
The above command has been split in multiple lines for better overview but should be entered in one line. |
Now, that was easy, wasn't it? In the above example - for assemblies having only 454 data and if you followed the walkthrough on how to prepare the data - everything you might want to adapt in the first time are the following options:
--project (for naming your assembly project)
--job (perhaps to change the quality of the assembly to 'draft'
Of course, you are free to change any option via the extended parameters, but this is covered in the MIRA main reference manual.
Preparing the data for a Sanger / 454 hybrid assembly takes some more steps but is not really more complicated than a normal Sanger-only or 454-only assembly.
In the following sections, the example project is named bchoc_hyb, simply for us to remember that we did a hybrid assembly there.
Files with 454 input data will have .454.
in the
name, files with Sanger have .sanger.
.
Please proceed exactly in the same way as described for the assembly of 454-only data in the section above, that is, without starting the actual assembly.
In the end you should have two files (FASTQ and TRACEINFO) for the 454 data ready.
There are quite a number of sequencing providers out there, all with different pre-processing pipelines and different output file-types. MIRA supports quite a number of them, the three most important would probably be
(preferred option) FASTQ files and ancillary data in NCBI TRACEINFO XML format.
(preferred option) FASTA files which are coupled with FASTA quality files and ancillary data in NCBI TRACEINFO XML format.
(preferred option) CAF (from the Sanger Institute) files that contain the sequence, quality values and ancillary data like clippings etc.
(secondary option, not recommended) EXP files as the Staden pregap4 package writes.
Your sequencing provider MUST have performed at least a sequencing vector clip on this data. A quality clip might also be good to do by the provider as they usually know best what quality they can expect from their instruments (although MIRA can do this also if you want).
You can either perform clipping the hard way by removing physically all bases from the input (this is called trimming), or you can keep the clipped bases in the input file and provided clipping information in ancillary data files. These clipping information then MUST be present in the ancillary data (either the TRACEINFO XML, or in the combined CAF, or in the EXP files), together with other standard data like, e.g., mate-pair information when using a paired-ends approach.
This example assumes that the data is provided as FASTA together with a quality file and ancillary data in NCBI TRACEINFO XML format.
Put these files (appropriately renamed) into the directory with the 454 data.
Here's how the directory with the preprocessed data should now look like (note that we changed the bchoc mnemonic to bchoc_hyb just for fun ... and to make a distinction to the 454 only assembly above):
arcadia:/path/to/myProject$
ls -l
-rwxrwxrwx 1 bach 2007-10-13 22:44 bchoc_hyb_in.454.fastq -rwxrwxrwx 1 bach 2007-10-13 22:44 bchoc_hyb_traceinfo_in.454.xml -rwxrwxrwx 1 bach 2007-10-13 22:44 bchoc_hyb_in.sanger.fasta -rwxrwxrwx 1 bach 2007-10-13 22:44 bchoc_hyb_in.sanger.fasta.qual -rwxrwxrwx 1 bach 2007-10-13 22:44 bchoc_hyb_traceinfo_in.sanger.xml
The following command line starts a basic, but normally quite respectable hybrid 454 and Sanger assembly of a genome where the 454 data has FASTQ + XML TRACEINFO and the Sanger data has FASTA, FASTA quality + XML traceinfo is input type:
arcadia:/path/to/myProject$
mira --project=bchoc_hyb --job=denovo,genome,accurate,sanger,454 SANGER_SETTINGS -LR:ft=fasta:mxti=yes >& log_assembly.txt
The only change to starting an assembly with only 454 data was adding "sanger" to the "-job=" command and telling MIRA that the Sanger data needs to be loaded from FASTA (+ quality) and merge ancillary information form the TRACEINFO file.
Here's a walkthrough which should help you in setting up own assemblies. You do not need to set up your directory structures as I do, but for this walkthrough it could help.
![]() | Note |
---|---|
This walkthrough was written at times when the NCBI still offered SFFs for 454 data, which now it does not anymore. However, the approach is still valid for your data where you should get SFFs. |
![]() | Note |
---|---|
This walkthrough was written at times when the primary input for 454 data in MIRA was using FASTA + FASTA quality files. This has shifted nowadays to FASTQ as input (it's more compact and faster to parse). I'm sure you will be able to make the necessary changes to the command line of sff_extract yourself :-) |
Please make sure that sff_extract is working properly and that you have
at least version 0.2.1 (use sff_extract -v
). Please also make sure
that SSAHA2 can be run correctly (test this by running ssaha2 -v
).
Note: this is how I set up a project, feel free to implement whatever structure suits your needs.
$
mkdir bceti
$
cd bceti
bceti_assembly$
mkdir origdata data assemblies
Your directory should now look like this:
arcadia:bceti$
ls -l
drwxr-xr-x 2 bach users 48 2008-11-08 16:51 assemblies drwxr-xr-x 2 bach users 48 2008-11-08 16:51 data drwxr-xr-x 2 bach users 48 2008-11-08 16:51 origdata
Explanation of the structure:
the origdata
directory will contain the 'raw'
result files that one might get from sequencing. Basically,.
the data
directory will contain the
preprocessed sequences for the assembly, ready to be used by MIRA
the assemblies
directory will contain
assemblies we make with our data (we might want to make more than
one).
![]() | Note |
---|---|
Since early summer 2009, the NCBI does not offer SFF files anymore, which is a pity. This guide will nevertheless allow you to perform similar assemblies on own data. |
Please browse to
http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?run=SRR005481&cmd=viewer&m=data&s=viewer
and
http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?run=SRR005482&cmd=viewer&m=data&s=viewer
and download the SFF files to the origdata
directory (press the download button on those pages).
En passant, note the following: SRR005481 is described to be a 454 FLX data set where the library contains unpaired data ("Library Layout: SINGLE"). SRR005482 has also 454 FLX data, but this time it's paired-end data ("Library Layout: PAIRED (ORIENTATION=forward)"). Knowing this will be important later on in the process.
arcadia:bceti$
cd origdata
arcadia:origdata$
ls -l
-rw-r--r-- 1 bach users 240204619 2008-11-08 16:49 SRR005481.sff.gz -rw-r--r-- 1 bach users 211333635 2008-11-08 16:55 SRR005482.sff.gz
We need to unzip those files:
arcadia:bceti_assembly/origdata$
gunzip *.gz
And now this directory should look like this
arcadia:bceti_assembly/origdata$
ls -l
-rw-r--r-- 1 bach users 544623256 2008-11-08 16:49 SRR005481.sff -rw-r--r-- 1 bach users 476632488 2008-11-08 16:55 SRR005482.sff
Now move into the (still empty) data
directory
arcadia:origdata$
cd ../data
We will first extract the data from the unpaired experiment (SRR005481), the generated file names should all start with bceti:
arcadia:bceti_assembly/data$
sff_extract -o bceti ../origdata/SRR005481.sff
Working on '../origdata/SRR005481.sff': Converting '../origdata/SRR005481.sff' ... done. Converted 311201 reads into 311201 sequences. ******************************************************************************** WARNING: weird sequences in file ../origdata/SRR005481.sff After applying left clips, 307639 sequences (=99%) start with these bases: TCTCCGTC This does not look sane. Countermeasures you *probably* must take: 1) Make your sequence provider aware of that problem and ask whether this can be corrected in the SFF. 2) If you decide that this is not normal and your sequence provider does not react, use the --min_left_clip of sff_extract. (Probably '--min_left_clip=13' but you should cross-check that) ********************************************************************************
(Note: I got this on the SRR005481 data set downloaded in October 2008. In the mean time, the sequencing center or NCBI may have corrected the error)
Wait a minute ... what happened here?
We launched a pretty standard extraction of reads where the whole sequence were extracted and saved in the FASTA files and FASTA quality files, and clipping information will be given in the XML. Additionally, the clipped parts of every read will be shown in lower case in the FASTA file.
After two or three minutes, the directory looked like this:
arcadia:bceti_assembly/data$
ls -l -rw-r--r-- 1 bach users 91863124 2008-11-08 17:15 bceti.fasta -rw-r--r-- 1 bach users 264238484 2008-11-08 17:15 bceti.fasta.qual -rw-r--r-- 1 bach users 52197816 2008-11-08 17:15 bceti.xml
In the example above, sff_extract discovered an unusual pattern sequence and gave a (stern) warning: almost all the sequences created for the FASTA file had a skew in the distribution of bases.
Let's have a look at the first 30 bases of the first 20 sequences of the FASTA that was created:
arcadia:bceti_assembly/data$
head -40 bceti_in.454.fasta | grep -v ">" | cut -c 0-30
tcagTCTCCGTCGCAATCGCCGCCCCCACA tcagTCTCCGTCGGCGCTGCCCGCCCGATA tcagTCTCCGTCGTGGAGGATTACTGGGCG tcagTCTCCGTCGGCTGTCTGGATCATGAT tcagTCTCCGTCCTCGCGTTCGATGGTGAC tcagTCTCCGTCCATCTGTCGGGAACGGAT tcagTCTCCGTCCGAGCTTCCGATGGCACA tcagTCTCCGTCAGCCTTTAATGCCGCCGA tcagTCTCCGTCCTCGAAACCAAGAGCGTG tcagTCTCCGTCGCAGGCGTTGGCGCGGCG tcagTCTCCGTCTCAAACAAAGGATTAGAG tcagTCTCCGTCCTCACCCTGACGGTCGGC tcagTCTCCGTCTTGTGCGGTTCGATCCGG tcagTCTCCGTCTGCGGACGGGTATCGCGG tcagTCTCCGTCTCGTTATGCGCTCGCCAG tcagTCTCCGTCTCGCATTTTCCAACGCAA tcagTCTCCGTCCGCTCATATCCTTGTTGA tcagTCTCCGTCCTGTGCTGGGAAAGCGAA tcagTCTCCGTCTCGAGCCGGGACAGGCGA tcagTCTCCGTCGTCGTATCGGGTACGAAC
What you see is the following: the leftmost 4
characters tcag
of every read are the last bases
of the standard 454 sequencing adaptor A. The fact that they are
given in lower case means that they are clipped away in the SFF
(which is good).
However, if you look closely, you will see that there is something
peculiar: after the adaptor sequence, all reads seem to start with
exactly the same sequence TCTCCGTC
. This is *not*
sane.
This means that the left clip of the reads in the SFF has not been set correctly. The reason for this is probably a wrong value which was used in the 454 data processing pipeline. This seems to be a problem especially when custom sequencing adaptors are used.
In this case, the result is pretty catastrophic: out of the 311201 reads in the SFF, 307639 (98.85%) show this behaviour. We will certainly need to get rid of these first 12 bases.
Now, in cases like these, there are three steps that you really should follow:
Is this something that you expect from the experimental setup? If yes, then all is OK and you don't need to take further action. But I suppose that for 99% of all people, these abnormal sequences are not expected.
Contact. Your. Sequence. Provider! The underlying problem is something that *MUST* be resolved on their side, not on yours. It might be a simple human mistake, but it it might very well be a symptom of a deeper problem in their quality assurance. Notify. Them. Now!
In the mean time (or if the sequencing provider does not react), you can use the [--min_left_clip] command line option from sff_extract as suggested in the warning message.
So, to correct for this error, we will redo the extraction of the sequence from the SFF, this time telling sff_extract to set the left clip starting at base 13 at the lowest:
arcadia:bceti_assembly/data$
sff_extract -o bceti --min_left_clip=13 ../origdata/SRR005481.sff
Working on '../origdata/SRR005481.sff': Converting '../origdata/SRR005481.sff' ... done. Converted 311201 reads into 311201 sequences.arcadia:sff_from_ncbi/bceti_assembly/data$
ls -l
-rw-r--r-- 1 bach users 91863124 2008-11-08 17:31 bceti.fasta -rw-r--r-- 1 bach users 264238484 2008-11-08 17:31 bceti.fasta.qual -rw-r--r-- 1 bach users 52509017 2008-11-08 17:31 bceti.xml
This concludes the small intermezzo on how to deal with wrong left clips.
Let's move on to the paired-end data. While I would recommend that, when working on your own data, you should do some kind of data checking, I'll spare you that step for this walkthrough, just believe me that I did it and I found nothing really too suspicious.
The paired-end protocol of 454 will generate reads which contain the forward and reverse direction in one read, separated by a linker. You have to know the linker sequence! Ask your sequencing provider to give it to you. If standard protocols were used, then the linker sequence for GS20 and FLX will be
>flxlinker GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC
while for Titanium data, you need to use two linker sequences
>titlinker1 TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG >titlinker2 CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
In this case, the center apparently used the standard unmodified 454
FLX linker. Put that linker sequence into a FASTA file and copy to
wherever you like ... in this walkthrough I put it into
the origdata
directory (not the
data
directory where we currently are.
arcadia:bceti_assembly/data$
cp /from/whereever/your/file/is/linker.fasta ../origdata
arcadia:bceti_assembly/data$
ls -l ../origdata
-rw-r--r-- 1 bach users 53 2008-11-08 17:32 linker.fasta -rw-r--r-- 1 bach users 544623256 2008-11-08 16:49 SRR005481.sff -rw-r--r-- 1 bach users 476632488 2008-11-08 16:55 SRR005482.sffarcadia:bceti_assembly/data$
cat ../origdata/linker.fasta
>flxlinker GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC
There's one thing that must be found out yet: what was the size of the paired-end library which was constructed, and what is the estimated standard deviation of the sizes? Normally, you will get this information from your sequence provider (if you didn't decide it for yourself). As we're working from a data set deposited at the NCBI, this information should also be available in the accompanying documentation there. But it isn't.
For this walkthrough, we'll simply take a library size of 4500 and an estimated standard deviation of 900.
Now let's extract the paired end sequences, and this may take eight to ten minutes.
arcadia:bceti_assembly/data$
sff_extract -o bceti -a -l ../origdata/linker.fasta -i "insert_size:3000,insert_stdev:900" ../origdata/SRR005482.sff
Testing whether SSAHA2 is installed and can be launched ... ok. Working on '../origdata/SRR005482.sff': Creating temporary file from sequences in '../origdata/SRR005482.sff' ... done. Searching linker sequences with SSAHA2 (this may take a while) ... ok. Parsing SSAHA2 result file ... done. Converting '../origdata/SRR005482.sff' ... done. Converted 268084 reads into 415327 sequences.
The above text tells you that the conversion process saw 268084 reads in the SFF. Searching for the paired-end linker and removing it, 415327 sequences were created. Obviously, some sequences had either no linker or the linker was on the far edges of the read so that the 'split' resulted into just one sequences.
The directory will now look like this:
arcadia:bceti_assembly/data$
ls -l
-rw-r--r-- 1 bach users 170346423 2008-11-08 17:55 bceti.fasta -rw-r--r-- 1 bach users 483048864 2008-11-08 17:55 bceti.fasta.qual -rw-r--r-- 1 bach users 165413112 2008-11-08 17:55 bceti.xml
We're almost done. As last step, we will rename the files into a scheme that suits MIRA (we could have used the -s, -q and -x options of sff_extract directly, but I wanted to keep the example straightforward.
arcadia:bceti_assembly/data$
mv bceti.fasta bceti_in.454.fasta
arcadia:bceti_assembly/data$
mv bceti.fasta.qual bceti_in.454.fasta.qual
arcadia:bceti_assembly/data$
mv bceti.xml bceti_traceinfo_in.454.xml
arcadia:bceti_assembly/data$
ls -l
-rw-r--r-- 1 bach users 170346423 2008-11-08 17:55 bceti_in.454.fasta -rw-r--r-- 1 bach users 483048864 2008-11-08 17:55 bceti_in.454.fasta.qual -rw-r--r-- 1 bach users 165413112 2008-11-08 17:55 bceti_traceinfo_in.454.xml
That's it.
Preparing an assembly is now just a matter of setting up a directory and linking the input files into that directory.
arcadia:bceti_assembly/data$
cd ../assemblies/
arcadia:bceti_assembly/assemblies$
mkdir arun_08112008
arcadia:bceti_assembly/assemblies$
cd arun_08112008
arcadia:assemblies/arun_08112008$
ln -s ../../data/* .
arcadia:bceti_assembly/assemblies/arun_08112008$
ls -l
lrwxrwxrwx 1 bach users 29 2008-11-08 18:17 bceti_in.454.fasta -> ../../data/bceti_in.454.fasta lrwxrwxrwx 1 bach users 34 2008-11-08 18:17 bceti_in.454.fasta.qual -> ../../data/bceti_in.454.fasta.qual lrwxrwxrwx 1 bach users 33 2008-11-08 18:17 bceti_traceinfo_in.454.xml -> ../../data/bceti_traceinfo_in.454.xml
![]() | Note |
---|---|
Please consult the corresponding section in the mirausage document, it contains much more information than this stub. |
But basically, after the assembly has finished, you will find four
directories. The tmp
directory can be deleted
without remorse as it contains logs and some tremendous amount of
temporary data (dozens of gigabytes for bigger
projects). The info
directory has some text files
with basic statistics and other informative files. Start by having a
look at the *_info_assembly.txt
, it'll give you a
first idea on how the assembly went.
The results
directory finally contains the assembly
files in different formats, ready to be used for further processing with
other tools.
If you used the uniform read distribution option, you will inevitably need to filter your results as this option produces larger and better alignments, but also more ``debris contigs''. For this, use the convert_project which is distributed together with the MIRA package.
Also very important when analysing 454 assemblies: screen the small contigs ( < 1000 bases) for abnormal behaviour. You wouldn't be the first to have some human DNA contamination in a bacterial sequencing. Or some herpes virus sequence in a bacterial project. Or some bacterial DNA in a human data set. Look whether these small contigs
have a different GC content than the large contigs
whether a BLAST of these sequences against some selected databases brings up hits in other organisms that you certainly were not sequencing.