Assembly of hard genome or EST / RNASeq projects

Bastien Chevreux

MIRA Version 3.4.1.1

Document revision $Id$

Table of Contents

1. Getting 'mean' genomes or EST / RNASeq data sets assembled
1.1. For the impatient
1.2. Introduction to 'masking'
1.3. How does "nasty repeat" masking work?
1.4. Selecting a "nasty repeat ratio"
2. How MIRA tags different repeat levels
3. The readrepeats info file
4. Pipeline to find worst contaminants or repeats in sequencing data
5. Examples for hash statistics
5.1. Caveat: -SK:bph
5.2. Sanger sequencing, a simple bacterium
5.3. 454 Sequencing, a somewhat more complex bacterium
5.4. Solexa sequencing, E.coli MG1655
5.5. (NEED EXAMPLES FOR EUKARYOTES)
5.6. (NEED EXAMPLES FOR PATHOLOGICAL CASES)
 

If it were easy, it would have been done already.

 
 --Solomon Short

1.  Getting 'mean' genomes or EST / RNASeq data sets assembled

For some EST data sets you might want to assemble, MIRA will take too long or the available memory will not be sufficient. For genomes this can be the case for eukaryotes, plants, but also for some bacteria which contain high number of (pro-)phages, plasmids or engineered operons. For EST data sets, this concerns all projects with non-normalised libraries.

This guide is intended to get you through these problematic genomes. It is (cannot be) exhaustive, but it should get you going.

1.1.  For the impatient

Use [-SK:mnr=yes:nrr=10] and give it a try. If that does not work, decrease [-SK:nrr] to anywhere between 5 and 9. If it worked well enough increase the [-SK:nrr] parameter up to 15 or 20. But please also read on to see how to choose the "nrr" threshold.

1.2.  Introduction to 'masking'

The SKIM phase (all-against-all comparison) will report almost every potential hit to be checked with Smith-Waterman further downstream in the MIRA assembly process. While this is absolutely no problem for most bacteria, some genomes (eukaryotes, plants, some bacteria) have so many closely related sequences (repeats) that the data structures needed to take up all information might get much larger than your available memory. In those cases, your only chance to still get an assembly is to tell the assembler it should disregard extremely repetitive features of your genome.

There is, in most cases, one problem: one doesn't know beforehand which parts of the genome are extremely repetitive. But MIRA can help you here as it produces most of the needed information during assembly and you just need to choose a threshold from where on MIRA won't care about repetitive matches.

The key to this are the two fail-safe command line parameters which will mask "nasty" repeats from the quick overlap finder (SKIM): [-SK:mnr] and [-SK:nrr=10]. [-SK:bph] also plays a role in this, but I'll come back to this later).

1.3.  How does "nasty repeat" masking work?

If switched on [-SK:mnr=yes], MIRA will use SKIM3 k-mer statistics to find repetitive stretches. K-mers are nucleotide stretches of length k. In a perfectly sequenced genome without any sequencing error and without sequencing bias, the k-mer frequency can be used to assess how many times a given nucleotide stretch is present in the genome: if a specific k-mer is present as many times as the average frequency of all k-mers, it is a reasonable assumption to estimate that the specific k-mer is not part of a repeat (at least not in this genome).

Following the same path of thinking, if a specific k-mer frequency is now two times higher than the average of all k-mers, one would assume that this specific k-mer is part of a repeat which occurs exactly two times in the genome. For 3x k-mer frequency, a repeat is present three times. Etc.pp. MIRA will merge information on single k-mers frequency into larger 'repeat' stretches and tag these stretches accordingly.

Of course, low-complexity nucleotide stretches (like poly-A in eukaryotes), sequencing errors in reads and non-uniform distribution of reads in a sequencing project will weaken the initial assumption that a k-mer frequency is representative for repeat status. But even then the k-mer frequency model works quite well and will give a pretty good overall picture: most repeats will be tagged as such.

Note that the parts of reads tagged as "nasty repeat" will not get masked per se, the sequence will still be present. The stretches dubbed repetitive will get the "MNRr" tag. They will still be used in Smith-Waterman overlaps and will generate a correct consensus if included in an alignment, but they will not be used as seed.

Some reads will invariably end up being completely repetitive. These will not be assembled into contigs as MIRA will not see overlaps as they'll be completely masked away. These reads will end up as debris. However, note that MIRA is pretty good at discerning 100% matching repeats from repeats which are not 100% matching: if there's a single base with which repeats can be discerned from each other, MIRA will find this base and use the k-mers covering that base to find overlaps.

1.4.  Selecting a "nasty repeat ratio"

The ratio from which on the MIRA SKIM algorithm won't report matches is set via [-SK:nrr]. E.g., using [-SK:nrr=10] will hide all k-mers which occur at a frequency 10 times (or more) higher than the median of all k-mers.

The nastiness of a repeat is difficult to judge, but starting with 10 copies in a genome, things can get complicated. At 20 copies, you'll have some troubles for sure.

The standard values of 10 for the [-SK:nrr] parameter is a pretty good 'standard' value which can be tried for an assembly before trying to optimise it via studying the hash statistics calculated by MIRA. For the later, please read the section 'Examples for hash statistics' further down in this guide.

2.  How MIRA tags different repeat levels

During SKIM phase, MIRA will assign frequency information to each and every k-mer in all reads of a sequencing project, giving them different status. Additionally, tags are set in the reads so that one can assess reads in assembly editors that understand tags (like gap4, gap5, consed etc.). The following tags are used:

HAF2

coverage below average ( default: < 0.5 times average)

HAF3

coverage is at average ( default: ≥ 0.5 times average and ≤ 1.5 times average)

HAF4

coverage above average ( default: > 1.5 times average and < 2 times average)

HAF5

probably repeat ( default: ≥ 2 times average and < 5 times average)

HAF6

'crazy' repeat ( default: > 5 times average)

MNRr

stretches which were masked away by [-SK:mnr=yes] being more that [-SK:nnr=...] repetitive.

3.  The readrepeats info file

If [-SK:mnr=yes] is used, MIRA will write an additional file into the info directory: <projectname>_info_readrepeats.lst

The "readrepeats" file makes it possible to try and find out what makes sequencing data nasty. It's a key-value-value file with the name of the sequence as "key" and then the type of repeat (HAF2 - HAF7 and MNRr) and the repeat sequence as "values". "Nasty" in this case means everything which was masked via [-SK:mnr=yes].

The file looks like this:

read1     HAF5   GCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCT ...
read2     HAF7   CCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGC ...
read2     MNRr   AAAAAAAAAAAAAAAAAAAAAAAAAAAA ...
read3     HAF6   GCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCT ...
...
etc.
    

That is, each line consists of the read name where a stretch of repetitive sequences was found, then the MIRA repeat categorisation level (HAF2 to HAF7 and MNRr) and then the stretch of bases which is seen to be repetitive.

Note that reads can have several disjunct repeat stetches in a single read, hence they can occur more than one time in the file as shown with read2 in the example above.

One will need to search some databases with the "nasty" sequences and find vector sequences, adaptor sequences or even human sequences in bacterial or plant genomes ... or vice versa as this type of contamination happens quite easily with data from new sequencing technologies. After a while one gets a feeling what constitutes the largest part of the problem and one can start to think of taking countermeasures like filtering, clipping, masking etc.

4.  Pipeline to find worst contaminants or repeats in sequencing data

[Note]Note

In case you are not familiar with UNIX pipes, now would be a good time to read an introductory text on how this wonderful system works. You might want to start with a short introductory article at Wikipedia: http://en.wikipedia.org/wiki/Pipeline_%28Unix%29

In a nutshell: instead of output to files, a pipe directs the output of one program as input to another program.

There's one very simple trick to find out whether your data contains some kind of sequencing vector or adaptor contamination which I use. it makes use of the read repat file discussed above.

The following example shows this exemplarily on a 454 data where the sequencing provider used some special adaptor in the wet lab but somehow forgot to tell the Roche pre-processing software about it, so that a very large fraction of reads in the SFF file had unclipped adaptor sequence in it (which of course wreaks havoc with assembly programs):

arcadia:$ grep MNRr badproject_info_readrepeats.lst | cut -f 3| sort | uniq -c |sort -g -r | head -15
    504 ACCACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    501 CAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    489 GGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    483 GCCACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    475 AATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    442 GATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    429 CGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    424 TTGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    393 ACTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    379 CAGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    363 ATTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    343 CATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    334 GTTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    328 AACACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    324 GGTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC

You probably see a sequence pattern CTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC in the above screenshot. Before going into details of what you are actually seeing, here's the explanation how this pipeline works:

grep MNRr badproject_info_readrepeats.lst

From the file with the information on repeats, grab all the lines containing repetitive sequence which MIRA categorised as 'nasty' via the 'MNRr' tag. The result looks a bit like this (first 15 lines shown):

C6E3C7T12GKN35  MNRr    GCGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12JLIBM  MNRr    TTCACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12HQOM1  MNRr    CAGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12G52II  MNRr    CAGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12JRMPO  MNRr    TCTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12H1A8V  MNRr    GCGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12H34Z7  MNRr    AAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12H4HGC  MNRr    GGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12FNA1N  MNRr    AATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12F074V  MNRr    CTTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12I1GYO  MNRr    CAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12I53C8  MNRr    CACACTCGTATAGTGACACGCAACAGGGG
C6E3C7T12I4V6V  MNRr    ATCACTCGTATAGTGACACGCAACAGGGG
C6E3C7T12H5R00  MNRr    TCTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
C6E3C7T12IBA5E  MNRr    AATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
...

cut -f 3

We're just interested in the sequence now, which is in the third column. The above 'cut' command takes care of this. The resulting output may look like this (only first 15 lines shown):

GCGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
TTCACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
CAGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
CAGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
TCTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
GCGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
AAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
GGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
AATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
CTTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
CAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
CACACTCGTATAGTGACACGCAACAGGGG
ATCACTCGTATAGTGACACGCAACAGGGG
TCTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
AATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
...
sort

Simply sort all sequences. The output may look like this now (only first 15 line shown):

AAACTCGTATAGTGACACGCA
AAACTCGTATAGTGACACGCAACAGG
AAACTCGTATAGTGACACGCAACAGGG
AAACTCGTATAGTGACACGCAACAGGGG
AAACTCGTATAGTGACACGCAACAGGGG
AAACTCGTATAGTGACACGCAACAGGGG
AAACTCGTATAGTGACACGCAACAGGGG
AAACTCGTATAGTGACACGCAACAGGGG
AAACTCGTATAGTGACACGCAACAGGGGAT
AAACTCGTATAGTGACACGCAACAGGGGATA
AAACTCGTATAGTGACACGCAACAGGGGATA
AAACTCGTATAGTGACACGCAACAGGGGATA
AAACTCGTATAGTGACACGCAACAGGGGATA
AAACTCGTATAGTGACACGCAACAGGGGATA
AAACTCGTATAGTGACACGCAACAGGGGATA
...

uniq -c

This command counts how often a line repeats itself in a file. As we previously sorted the whole file by sequence, it effectively counts how often a certain sequence has been tagged as MNRr. The output consists of a tab delimited format in two columns: the first column contains the number of times a given line (sequence in our case) was seen, the second column contains the line (sequence) itself. An exemplariy output looks like this (only first 15 lines shown):

      1 AAACTCGTATAGTGACACGCA
      1 AAACTCGTATAGTGACACGCAACAGG
      1 AAACTCGTATAGTGACACGCAACAGGG
      5 AAACTCGTATAGTGACACGCAACAGGGG
      1 AAACTCGTATAGTGACACGCAACAGGGGAT
     13 AAACTCGTATAGTGACACGCAACAGGGGATA
      6 AAACTCGTATAGTGACACGCAACAGGGGATAGAC
      4 AAACTCGTATAGTGACACGCAACAGGGGATAGACAA
      9 AAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGC
      3 AAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCA
    257 AAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
      1 AACACTCGTATAGTGACACGCAAC
      2 AACACTCGTATAGTGACACGCAACAGGG
     23 AACACTCGTATAGTGACACGCAACAGGGG
      6 AACACTCGTATAGTGACACGCAACAGGGGATA
...
sort -g -r

We now sort the output of the previous uniq-counting command by asking 'sort' to perform a numerical sort (via '-g') and additionally sort in reverse order (via '-r') so that we get the sequences encountered most often at the top of the output. And that one looks exactly like shown previously:

    504 ACCACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    501 CAACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    489 GGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    483 GCCACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    475 AATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    442 GATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    429 CGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    424 TTGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    393 ACTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    379 CAGACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    363 ATTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    343 CATACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    334 GTTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    328 AACACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
    324 GGTACTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC
...

So, what is this ominous CTCGTATAGTGACACGCAACAGGGGATAGACAAGGCAC you are seeing? To make it short: a modified 454 B-adaptor with an additional MID sequence.

[Note]Note

These adaptor sequences have absolutely no reason to exist in your data, none! Go back to your sequencing provider and ask them to have a look at their pipeline as they should have had it set up in a way that you do not see these things anymore. Yes, due to sequencing errors, sometimes some adaptor or sequencing vectors remnants will stay in your sequencing data, but that is no problem as MIRA is capable of handling that very well.

But having much more than 0.1% to 0.5% of your sequence containing these is a sure sign that someone goofed somewhere ... and it's very probably not your fault.

5.  Examples for hash statistics

Selecting the right ratio so that an assembly fits into your memory is not straight forward. But MIRA can help you a bit: during assembly, some frequency statistics are printed out (they'll probably end up in some info file in later releases). Search for the term "Hash statistics" in the information printed out by MIRA (this happens quite early in the process)

5.1.  Caveat: -SK:bph

Some explanation how bph affects the statistics and why it should be chosen >=16 for [-SK:mnr]

5.2.  Sanger sequencing, a simple bacterium

This example is taken from a pretty standard bacterium where Sanger sequencing was used:

Hash statistics:
=========================================================
Measured avg. coverage: 15

Deduced thresholds:
-------------------
Min normal cov: 7
Max normal cov: 23
Repeat cov: 29
Crazy cov: 120
Mask cov: 150

Repeat ratio histogram:
-----------------------
0       475191
1       5832419
2       181994
3       6052
4       4454
5       972
6       4
7       8
14      2
16      10
=========================================================
      

The above can be interpreted like this: the expected coverage of the genome is 15x. Starting with an estimated hash frequency of 29, MIRA will treat a k-mer as 'repetitive'. As shown in the histogram, the overall picture of this project is pretty healthy:

  • only a small fraction of k-mers have a repeat level of '0' (these would be k-mers in regions with quite low coverage or k-mers containing sequencing errors)

  • the vast majority of k-mers have a repeat level of 1 (so that's non- repetitive coverage)

  • there is a small fraction of k-mers with repeat level of 2-10

  • there are almost no k-mers with a repeat level >10

5.3.  454 Sequencing, a somewhat more complex bacterium

Here's in comparison a profile for a more complicated bacterium (454 sequencing):

Hash statistics:
=========================================================
Measured avg. coverage: 20

Deduced thresholds:
-------------------
Min normal cov: 10
Max normal cov: 30
Repeat cov: 38
Crazy cov: 160
Mask cov: 0

Repeat ratio histogram:
-----------------------
0       8292273
1       6178063
2       692642
3       55390
4       10471
5       6326
6       5568
7       3850
8       2472
9       708
10      464
11      270
12      140
13      136
14      116
15      64
16      54
17      54
18      52
19      50
20      58
21      36
22      40
23      26
24      46
25      42
26      44
27      32
28      38
29      44
30      42
31      62
32      116
33      76
34      80
35      82
36      142
37      100
38      120
39      94
40      196
41      172
42      228
43      226
44      214
45      164
46      168
47      122
48      116
49      98
50      38
51      56
52      22
53      14
54      8
55      2
56      2
57      4
87      2
89      6
90      2
92      2
93      2
1177    2
1181    2
=========================================================
      

The difference to the first bacterium shown is pretty striking:

  • first, the k-mers in repeat level 0 (below average) is higher than the k-mers of level 1! This points to a higher number of sequencing errors in the 454 reads than in the Sanger project shown previously. Or at a more uneven distribution of reads (but not in this special case).

  • second, the repeat level histogram does not trail of at a repeat frequency of 10 or 15, but it has a long tail up to the fifties, even having a local maximum at 42. This points to a small part of the genome being heavily repetitive ... or to (a) plasmid(s) in high copy numbers.

Should MIRA ever have problems with this genome, switch on the nasty repeat masking and use a level of 15 as cutoff. In this case, 15 is OK to start with as a) it's a bacterium, it can't be that hard and b) the frequencies above level 5 are in the low thousands and not in the tens of thousands.

5.4.  Solexa sequencing, E.coli MG1655

Hash statistics:
=========================================================
Measured avg. coverage: 23

Deduced thresholds:
-------------------
Min normal cov: 11
Max normal cov: 35
Repeat cov: 44
Crazy cov: 184
Mask cov: 0

Repeat ratio histogram:
-----------------------
0       1365693
1       8627974
2       157220
3       11086
4       4990
5       3512
6       3922
7       4904
8       3100
9       1106
10      868
11      788
12      400
13      186
14      28
15      10
16      12
17      4
18      4
19      2
20      14
21      8
25      2
26      8
27      2
28      4
30      2
31      2
36      4
37      6
39      4
40      2
45      2
46      8
47      14
48      8
49      4
50      2
53      2
56      6
59      4
62      2
63      2
67      2
68      2
70      2
73      4
75      2
77      4
=========================================================
      

This hash statistics shows that MG1655 is pretty boring (from a repetitive point of view). One might expect a few repeats but nothing fancy: The repeats are actually the rRNA and sRNA stretches in the genome plus some intergenic regions.

  • the k-mers number in repeat level 0 (below average) is considerably lower than the level 1, so the Solexa sequencing quality is pretty good respectively there shouldn't be too many low coverage areas.

  • the histogram tail shows some faint traces of possibly highly repetitive k-mers, but these are false positive matches due to some standard Solexa base-calling weaknesses of earlier pipelines like, e.g., adding poly-A, poly-T or sometimes poly-C and poly-G tails to reads when spots in the images were faint and the base calls of bad quality

5.5.  (NEED EXAMPLES FOR EUKARYOTES)

5.6.  (NEED EXAMPLES FOR PATHOLOGICAL CASES)

Vector contamination etc.