Assembly of Ion Torrent data with MIRA3

Bastien Chevreux

MIRA Version 3.4.1.1

Document revision $Id$

Table of Contents

1. Introduction
1.1. Some reading requirements
2. Characteristics of Ion Torrent data
2.1. Homopolymer insertions / deletions
2.2. Sequencing direction dependend insertions / deletions
2.3. Coverage variance
2.4. GC bias
2.5. Other sources of error
2.6. Where to find further information
3. An Ion Torrent assembly walkthrough
3.1. Preparing your file system
3.2. Getting the data for this walkthrough
3.3. Preparing the Ion Torrent data for MIRA
3.4. Starting the assembly
4. What to do with the MIRA result files?
 

A baby is Life's way of insisting that the universe give it another chance.

 
 --Solomon Short (modified)

1.  Introduction

MIRA can assemble Ion Torrent type data either on its own or together with Sanger, 454 or Solexa type sequencing data (true hybrid assembly). Paired-end sequences coming from genomic projects can also be used if you take care to prepare your data the way MIRA needs it.

MIRA goes a long way to assemble sequence in the best possible way: it uses multiple passes, learning in each pass from errors that occurred in the previous passes. There are routines specialised in handling oddities that occur in different sequencing technologies

[Warning]Warning

Ion Torrent is pretty new and I did not have as much data to analyse as I had with Sanger, 454 or Solexa. MIRA has been configured to automatically work well with data currently available on the market and with data which is to be expected during the course of 2011 / 2012.

However, IonTorrent is - at the moment - a moving target: there are new protocols every few months and it might be that you need to fetch the latest MIRA version available to get the best possible assembly.

1.1.  Some reading requirements

This guide assumes that you have basic working knowledge of Unix systems, know the basic principles of sequencing (and sequence assembly) and what assemblers do.

While there are step by step walkthroughs on how to setup your Ion Torrent data and then perform an assembly, this guide expects you to read at some point in time

  • the mira_usage introductory help file so that you have a basic knowledge on how to set up projects in mira for Sanger sequencing projects.

  • and last but not least the mira_reference help file to look up some command line options.

2.  Characteristics of Ion Torrent data

What I can say at the moment is that Ion Torrent reads behave very much like the early data from the 454 technology (454 GS20): reads are mostly between 90 and 110 bases long, with Ion Torrent having a showcase with reads of ~220 to 230 bases. The following figure shows what you can get out of 100bp reads if you're lucky:

Figure 1.  Example for good IonTorrent data (100bp reads). Note that only a single sequencing error - shown by blue background - can be seen. Except this, all homopolymers of size 3 and 4 in the area shown are good.

Example for good IonTorrent data (100bp reads). Note that only a single sequencing error - shown by blue background - can be seen. Except this, all homopolymers of size 3 and 4 in the area shown are good.

The "if you're lucky" part in the preceding sentence is not there by accident: having so many clean reads is more of an exception rather a rule. On the other hand, most sequencing errors in current IonTorrent data are unproblematic ... if it were not for indels, which is going to be explained on the next sections.

2.1.  Homopolymer insertions / deletions

The main source of error in your data will be insertions / deletions (indels) especially in homopolymer regions (but not only there, see also next section). Starting with a base run of 4 to 6 bases, there is a distinct tendency to have an increased occurence of indel errors.

Figure 2.  Example for problematic IonTorrent data (100bp reads).

Example for problematic IonTorrent data (100bp reads).

The above figure contains a couple of particularly nasty indel problems. While areas 2 (C-homopolymer length 3), 5 (A-homopolymer length 4) and 6 (T-homopolymer length 3) are not a big problem as most of the reads got the length right, the areas 1, 3 and 4 are nasty.

Area 1 is an A-homopolymer of length 7 and while many reads geth that length right (enough to tell MIRA what the true length is), it also contains reads with a length of 6 and and others with a lengh of 8.

Area 2 is a "A-homopolymer" of length 2 where approximately half of the reads get the length right, the other half not. See also the following section.

Area 4 is a T-homopolymer of length 5 which also has approximately half the reads with a wrong length of 4.

2.2.  Sequencing direction dependend insertions / deletions

In the previous section, the screenshot showing indels had an indel at a homopolymer of 2, which is something quite curious. Upon closer investigation, one might notice a pattern in the gap/nogap distribution: it is almost identical to the orientation of build direction of reads!

I looked for other examples of this behaviour and found quite a number of them, the following figure shows a very clear case of that error behaviour:

Figure 3.  Example for a sequencing direction dependend indel. Note how all but one of the reads in '+' direction miss a base while all reads built in in '-' direction have the correct number of bases.

Example for a sequencing direction dependend indel. Note how all but one of the reads in '+' direction miss a base while all reads built in in '-' direction have the correct number of bases.

This is quite astonishing: the problem occurs at a site without real homopolymer (calling a 2-bases run a 'homopolymer' starts stretching the definition a bit) and there are no major problematic homopolymer sites near. In fact, this was more or less the case for all sites I had a look at.

Neither did the cases which were investigated show common base patterns, so unlike the Solexa GGCxG motif it does not look like that error of IonTorrent is bound to a particular motif.

While I cannot prove the following statement, I somehow suspect that there must be some kind of secondary structure forming which leads to that kind of sequencing error. If anyone has a good explanation I'd be happy to hear it: feel free to contact me at .

2.3.  Coverage variance

The coverage variance with the current ~100bp reads is a bit on the bad side for low coverage projects (10x to 15x): it varies wildly, sometimes dropping to nearly zero, sometimes reaching approximately double the coverage.

While showing the same up and down, the effect on an assembly will be less pronounced with higher coverages (25x and more) as the chance increases that some reads are sequenced that span a gap. The following two figures show typical coverage plots for the E. coli data (100bp reads, ~33x coverage) published by Ion Torrent.

Figure 4.  IonTorrent coverage (1) 320kb contig

IonTorrent coverage (1) 320kb contig

Figure 5.  IonTorrent coverage (1) zoom of a 12kb stretch

IonTorrent coverage (1) zoom of a 12kb stretch

From these figures (and some other data I have) I expect that one would need a coverage of

≤1x

for rough bug identification, i.e. answering what it is.

~ 5x

for rough pathway exploration, i.e., answering the question which pathways are more or less present (even if one misses a gene or two in different pathways).

~ 12x to 15x

for gene fishing expeditions, i.e., get enough sequence to have almost all genes of an organism somehow present, even if some are fragmented into different contigs or contain sequencing errors.

≥ 25x

for assemblies which are not too bad

≥ 40x

for assemblies which represent the best possible thing you can get with IonTorrent nowadays.

2.4.  GC bias

The GC bias seems to be small to non-existent, at least I could not immediately make a correlation between GC content and coverage. However, the only data sets I've seen so far are for E. coli which has a GC content of rough 50% ... I'd like to check GC bias with a couple of other organisms before giving a final statement.

2.5.  Other sources of error

You will want to keep an eye on the clipping of the data in the SFF files from IonTorrent: while it is generally good enough, some data sets of IonTorrent show that - for some error patterns - the clipping is too lax and strange artefacts appear. MIRA will take care of these - or at least of those it knows - but you should be aware of this potential problem.

2.6.  Where to find further information

IonTorrent being pretty new, getting as much information on that technology is quite important. So here are a couple of links I found to be helpful:

  • There is, of course, the TorrentDev site (http://lifetech-it.hosted.jivesoftware.com/community/torrent_dev) at Life Technologies which will be helpful to get a couple of questions answered.

    Just be aware that some of the documents over there are sometimes painting an - how should I say it diplomatically? - overly optimistic view on the performance of the technology. On the other hand, so do documents released by the main competitors like 454/Roche, Illumina, PacBio etc. ... so no harm done there.

  • I found Nick Loman's blog Pathogens: Genes and Genomes to be my currently most valuable source of information on IonTorrent. While the group he works for won a sequencer from IonTorrent, he makes that fact very clear and still unsparingly dissects the data he gets from that machine.

    His posts got me going in getting MIRA grok IonTorrent.

  • The blog of Lex Nederbragt In between lines of code is playing in the same league: very down to earth and he knows a bluff when he sees it ... and is not afraid to call it (be it from IonTorrent, PacBio or 454).

    The analysis he did on a couple of Ion data sets have saved me quite some time.

  • Last, but not least, the board with IonTorrent-related-stuff over at SeqAnswers, the first and foremost one-stop-shop ... erm ... discussion board for everything related to sequencing nowadays.

3.  An Ion Torrent assembly walkthrough

This walkthrough will use two data sets for E. coli strain DH10B made available by IonTorrent and will show you the main steps you need to perform to get assemblies going.

3.1.  Preparing your file system

Note: this is how I set up a project, feel free to implement whatever structure suits your needs.

arcadia:$ mkdir dh10b
arcadia:$ cd dh10b
arcadia:dh10b$ mkdir origdata data assemblies

Your directory should now look like this:

arcadia:dh10b$ ls -l
drwxr-xr-x 2 bach users 48 2011-08-12 22:43 assemblies
drwxr-xr-x 2 bach users 48 2011-08-12 22:43 data
drwxr-xr-x 2 bach users 48 2011-08-12 22:43 origdata

Explanation of the structure:

  • the origdata directory will contain the 'raw' result files that one might get from sequencing. In our case it will be the ZIP files from the IonTorrent site.

  • the data directory will contain the preprocessed sequences for the assembly, ready to be used by MIRA

  • the assemblies directory will contain assemblies we make with our data (we might want to make more than one).

3.2.  Getting the data for this walkthrough

The data sets in question are

  1. E. coli DH10B, PGM run B13-328 which you can download from http://lifetech-it.hosted.jivesoftware.com/docs/DOC-1651 (download the SFF). This data set, subsequently nicknamed B13, contains data from the 316 chip with reads of an average size of ~100bp.

  2. E. coli DH10B, PGM run B14-387 which you can download from http://lifetech-it.hosted.jivesoftware.com/docs/DOC-1848. This data set, subsequently nicknamed B14, contains data from the 314 chip which IonTorrent uses to show off the "longer reads" capability of its sequencer. "Longer" meaning in this case an average of ~220 bp, which is not bad at all.

Save the two ZIP files into the origdata directory should now look like this:

arcadia:dh10b$ ls -l origdata
-rw-r--r-- 1 bach bach 824002890 2011-08-15 21:43 B13_328.sff.zip
-rw-r--r-- 1 bach bach 327926296 2011-08-14 20:32 B14_387_CR_0.05.sff.zip

Our data is still in ZIP files, let's get them out and put them into the data directory:

arcadia:dh10b$ cd data
arcadia:data$ unzip ../origdata/B13_328.sff.zip
Archive:  ../origdata/B13_328.sff.zip
  inflating: B13_328.sff             
   creating: __MACOSX/
  inflating: __MACOSX/._B13_328.sff  
arcadia:data$ unzip ../origdata/B14_387_CR_0.05.sff.zip
Archive:  ../origdata/B14_387_CR_0.05.sff.zip
  inflating: R_2011_07_19_20_05_38_user_B14-387-r121336-314_pool30-ms_B14-387_cafie_0.05.sff
arcadia:data$ ls -l
-rw-r--r-- 1 bach bach 1721658336 2011-06-17 01:29 B13_328.sff
drwxrwxr-x 2 bach bach       4096 2011-06-21 16:16 __MACOSX
-rw-rw-r-- 1 bach bach  688207032 2011-07-28 23:31 R_2011_07_19_20_05_38_user_B14-387-r121336-314_pool30-ms_B14-387_cafie_0.05.sff

Oooops, quite some chaos ... IonTorrent included some unnecessary things (the __MACOSX directory) and gave their data files wildly different names. Let's clean up a bit here:

arcadia:data$ rm -rf __MACOSX
arcadia:data$ mv B13_328.sff B13.sff
arcadia:data$ mv R_2011_07_19_20_05_38_user_B14-387-r121336-314_pool30-ms_B14-387_cafie_0.05.sff B14.sff
arcadia:data$ ls -l
-rw-r--r-- 1 bach bach 1721658336 2011-06-17 01:29 B13.sff
-rw-rw-r-- 1 bach bach  688207032 2011-07-28 23:31 B14.sff

There, much nicer.

3.3.  Preparing the Ion Torrent data for MIRA

MIRA will need the base sequences, quality values attached to those bases and - if already present - clipping points for quality clips and sequencing adaptor clips.

The basic data type you will get from the sequencing instruments will be SFF files. Those files contain almost all information needed for an assembly, but SFFs need to be converted into more standard files before MIRA can use this information.

In former times this was done using 3 files (FASTA, FASTA quality and XML), but nowadays the FASTQ format is used almost everywhere, so we will need only two files: FASTQ for sequence + quality and XML for clipping information.

[Note]Tip
Use the sff_extract script from Jose Blanca at the University of Valencia. The home of sff_extract is: http://bioinf.comav.upv.es/sff_extract/index.html but I am thankful to Jose for giving permission to distribute the script in the MIRA 3rd party package (separate download on SourceForge).

The data sets B13 and B14 have short and long IonTorrent reads and we will want to assemble them together, so let's put them together into the input files MIRA needs. For the sake of clarity, I want to name that assembly project dh10b_b13b14.

arcadia:data$ sff_extract 
  -Q
  -s dh10b_b13b14_in.iontor.fastq
  -x dh10b_b13b14_traceinfo_in.iontor.xml
  B13.sff B14.sff
Working on 'B13.sff':
Converting 'B13.sff' ...  done.
Converted 1687490 reads into 1687490 sequences.
Working on 'B14.sff':
Converting 'B14.sff' ...  done.
Converted 350109 reads into 350109 sequences.
[Note]Note
The above command has been split in multiple lines for better overview but should be entered in one line.

The parameters to sff_extract tell it to extract to FASTQ (via -Q), give the FASTQ file a name we chose (via -s), give the XML file with clipping information a name we chose (via -x) and convert the SFFs named B13.sff and B14.sff.

[Warning]Warning

People "in the know" might want to get rid of the XML TRACEINFO file and tell sff_extract to simply dump hard-clip sequences into the FASTQ file via the [-c] argument. Hard-clipped means: the clipped sequence parts of a read are physically trimmed away, never to be seen again.

This is D I S C O U R A G E D !

Reason: unlike 454, IonTorrent actually uses actively the SFF feature to set different clipping points for quality and adaptor clips. This is useful information for MIRA. Furthermore, some of the quality control algorithms of MIRA use also the clipped part of a read to improve assembly quality with measurable effect. If a hard clip was performed on the sequences, these algorithms are not as effective anymore.

For the die hards out there who really do not want the XML TRACEINFO files: if MIRA gets only the sequence, it will use the usual 454/Roche convention to treat left and right lower case part of sequences as clipped and retain the uppercase middle part of a sequence. sff_extract adheres to this convention, and while the resulting assemblies are not quite as good as with the TRACEINFO XML, they're still better than with hard clipped sequences.

The conversion can take some time, the ~2 million IonTorrent reads from this example need approximately 2.5 minutes for conversion. Go grab a coffee, or tea, or whatever.

Welcome back. Your directory should now look something like this:

arcadia:data$ ls -l
-rw-r--r-- 1 bach bach  771462745 2011-08-19 22:24 dh10b_b13b14_in.iontor.fastq
-rw-r--r-- 1 bach bach  441331004 2011-08-19 22:24 dh10b_b13b14_traceinfo_in.iontor.xml
-rw-r--r-- 1 bach bach 1721658336 2011-06-17 01:29 B13.sff
-rw-rw-r-- 1 bach bach  688207032 2011-07-28 23:31 B14.sff

Cool. Last step: we do not need the SFF files anymore, let's get rid of them:

arcadia:data$ rm *sff
arcadia:data$ ls -l
-rw-r--r-- 1 bach bach  771462745 2011-08-19 22:24 dh10b_b13b14_in.iontor.fastq
-rw-r--r-- 1 bach bach  441331004 2011-08-19 22:24 dh10b_b13b14_traceinfo_in.iontor.xml

3.4.  Starting the assembly

Good, we're almost there. Let's switch to the assembly directory and create a subdirectory for our first assembly test.

arcadia:data$ cd ../assemblies/
arcadia:assemblies$ mkdir 1sttest
arcadia:assemblies$ cd 1sttest

This directory is quite empty and the IonTorrent data is not present. Let's put some links to the files created in the previous step:

arcadia:1sttest$ ln -s ../../data/* .
arcadia:1sttest$ ls -l
lrwxrwxrwx 1 bach bach 42 2011-08-19 22:44 dh10b_b13b14_in.iontor.fastq -> ../../data/dh10b_b13b14_in.iontor.fastq
lrwxrwxrwx 1 bach bach 40 2011-08-19 22:44 dh10b_b13b14_in.iontor.xml -> ../../data/dh10b_b13b14_traceinfo_in.iontor.xml

Starting the assembly is now just a matter of one line with some parameters set correctly:

arcadia:1sttest$ mira 
--project=dh10b_b13b14
--job=denovo,genome,accurate,iontor
>&log_assembly.txt 
[Note]Note
The above command has been split in multiple lines for better overview but should be entered in one line.

Now, that was easy, wasn't it? In the above example - for assemblies having only Ion Torrent data and if you followed the walkthrough on how to prepare the data - everything you might want to adapt in the first time are the following options:

  • --project (for naming your assembly project)

  • --job (perhaps to change the quality of the assembly to 'draft'

Of course, you are free to change any option via the extended parameters, perhaps change the default number of processors to use from 2 to 4 via [-GE:not=4] or any other of the > 150 parameters MIRA has ... but this is covered in the MIRA main reference manual.

4.  What to do with the MIRA result files?

[Note]Note
Please consult the corresponding section in the mirausage document, it contains much more information than this stub.

But basically, after the assembly has finished, you will find four directories. The tmp directory can be deleted without remorse as it contains logs and some tremendous amount of temporary data (dozens of gigabytes for bigger projects). The info directory has some text files with basic statistics and other informative files. Start by having a look at the *_info_assembly.txt, it'll give you a first idea on how the assembly went.

The results directory finally contains the assembly files in different formats, ready to be used for further processing with other tools.

If you used the uniform read distribution option, you will inevitably need to filter your results as this option produces larger and better alignments, but also more "debris contigs". For this, use the convert_project which is distributed together with the MIRA package.

Also very important when analysing Ion Torrent assemblies: screen the small contigs ( < 1000 bases) for abnormal behaviour. You wouldn't be the first to have some human DNA contamination in a bacterial sequencing. Or some herpes virus sequence in a bacterial project. Or some bacterial DNA in a human data set. Or ...

Look whether these small contigs

  • have a different GC content than the large contigs

  • whether a BLAST of these sequences against some selected databases brings up hits in other organisms that you certainly were not sequencing.