The MAF format

Bastien Chevreux

MIRA Version 3.4.1.1

Document revision $Id$

Table of Contents

1. Introduction: why an own assembly format?
2. The MAF format
2.1. Basics
2.2. Reads
2.2.1. Simple example
2.2.2. List of records for reads
2.2.3. Interpreting clipping values
2.3. Contigs
2.3.1. Simple example 2
2.3.2. List of records for contigs
 

Design flaws travel in herds.

 
 --Solomon Short

This documents describes purpose and format of the MAF format, version 1.

1.  Introduction: why an own assembly format?

I had been on the hunt for some time for a file format that allow MIRA to quickly save and load reads and full assemblies. There are currently a number of alignment format files on the market and MIRA can read and/or write most of them. Why not take one of these? It turned out that all (well, the ones I know: ACE, BAF, CAF, CALF, EXP, FRG) have some kind of no-go 'feature' (or problem or bug) that makes one life pretty difficult if one wants to write or parse that given file format.

What I needed for MIRA was a format that:

  1. is easy to parse

  2. is quick to parse

  3. contains all needed information of an assembly that MIRA and many finishing programs use: reads (with sequence and qualities) and contigs, tags etc.pp

MAF is not a format with the smallest possible footprint though it fares quite well in comparison to ACE, CAF and EXP), but as it's meant as interchange format, it'll do. It can be easily indexed and does not need string lookups during parsing.

I took the liberty to combine many good ideas from EXP, BAF, CAF and FASTQ while defining the format and if anything is badly designed, it's all my fault.

2.  The MAF format

This describes version 1 of the MAF format. If the need arises, enhancements like metadata about total number of contigs and reads will be implemented in the next version.

2.1.  Basics

MAF ...

  1. ... has for each record a keyword at the beginning of the line, followed by exactly one blank (a space or a tab), then followed by the values for this record. At the moment keywords are two character keywords, but keywords with other lengths might appear in the future

  2. ... is strictly line oriented. Each record is terminated by a newline, no record spans across lines.

All coordinates start at 1, i.e., there is no 0 value for coordinates.

2.2.  Reads

2.2.1.  Simple example

Here's an example for a simple read, just the read name and the sequence:

	  RD      U13a05e07.t1
	  RS      CTTGCATGCCTGCAGGTCGACTCTAGAAGGACCCCGATCA
	  ER
	

Reads start with RD and end with ER, the RD keyword is always followed by the name of the read, ER stands on its own. Reads also should contain a sequence (RS). Everything else is optional. In the following example, the read has additional quality values (RQ), template definitions (name in TN, minimum and maximum insert size in TF and TT), a pointer to the file with the raw data (SF), a left clip which covers sequencing vector or adaptor sequence (SL), a left clip covering low quality (QL), a right clip covering low quality (QR), a right clip covering sequencing vector or adaptor sequence (SR), alignment to original sequence (AO), a tag (RT) and the sequencing technology it was generated with (ST).

	  RD      U13a05e07.t1
	  RS      CTTGCATGCCTGCAGGTCGACTCTAGAAGGACCCCGATCA
	  RQ      ,-+*,1-+/,36;:6≤3327<7A1/,,).('..7=@E8:
	  TN      U13a05e07
	  DI      F
	  TF      1200
	  TT      1800
	  SF      U13a05e07.t1.scf
	  SL      4
	  QL      7
	  QR      30
	  SR      32
	  AO      1 40 1 40
	  RT      ALUS 10 15 Some comment to this read tag.
	  ST      Sanger
	  ER
	

2.2.2.  List of records for reads

  • RD string: readname

    RD followed by the read name starts a read.

  • LR integer: read length

    The length of the read can be given optionally in LR. This is meant to help the parser perform sanity checks and eventually pre-allocate memory for sequence and quality.

    MIRA at the moment only writes LR lines for reads with more than 2000 bases.

  • RS string: DNA sequence

    Sequence of a read is stored in RS.

  • RQ string: qualities

    Qualities are stored in FASTQ format, i.e., each quality value + 33 is written as single as ASCII character.

  • SV string: sequencing vector

    Name of the sequencing vector or adaptor used in this read.

  • TN string: template name

    Template name. This defines the DNA template a sequence comes from. In it's simplest form, a DNA template is sequenced only once. In paired-end sequencing, a DNA template is sequenced once in forward and once in reverse direction (Sanger, 454, Solexa). In Sanger sequencing, several forward and/or reverse reads can be sequenced from a DNA template. In PacBio sequencing, a DNA template can be sequenced in several "strobes", leading to multiple reads on a DNA template.

  • DI character: F or R

    Direction of the read with respect to the template. F for forward, R for reverse.

  • TF integer: template size from

    Minimum estimated size of a sequencing template. In paired-end sequencing, this is the minimum distance of the read pair.

  • TT integer: template size to

    Maximum estimated size of a sequencing template. In paired-end sequencing, this is the maximum distance of the read pair.

  • SF string: sequencing file

    Name of the sequencing file which contains raw data for this read.

  • SL integer: seqvec left

    Clip left due to sequencing vector. Assumed to be 1 if not present. Note that left clip values are excluding, e.g.: a value of '7' clips off the left 6 bases.

  • QL integer: qual left

    Clip left due to low quality. Assumed to be 1 if not present. Note that left clip values are excluding, e.g.: a value off '7' clips of the left 6 bases.

  • CL integer: clip left

    Clip left (any reason). Assumed to be 1 if not present. Note that left clip values are excluding, e.g.: a value of '7' clips off the left 6 bases.

  • SR integer: seqvec right

    Clip right due to sequencing vector. Assumed to be the length of the sequence if not present. Note that right clip values are including, e.g., a value of '10' leaves the bases 1 to 9 and clips at and including base 10 and higher.

  • QR integer: qual right

    Clip right due to low quality. Assumed to be the length of the sequence if not present. Note that right clip values are including, e.g., a value of '10' leaves the bases 1 to 9 and clips at and including base 10 and higher.

  • CR integer: clip right

    Clip right (any reason). Assumed to be the length of the sequence if not present. Note that right clip values are including, e.g., a value of '10' leaves the bases 1 to 9 and clips at and including base 10 and higher.

  • AO four integers: x1 y1 x2 y2

    AO stands for "Align to Original". The interval [x1 y1] in the read as stored in the MAF file aligns with [x2 y2] in the original, unedited read sequence. This allows to model insertions and deletions in the read and still be able to find the correct position in the original, base-called sequence data.

    A read can have several AO lines which together define all the edits performed to this read.

    Assumed to be "1 x 1 x" if not present, where 'x' is the length of the unclipped sequence.

  • RT string + 2 integers + optional string: type x1 y1 comment

    Read tags are given by naming the tag type, which positions in the read the tag spans in the interval [x1 y1] and afterwards optionally a comment. As MAF is strictly line oriented, newline characters in the comment are encoded as \n.

    If x1 > y1, the tag is in reverse direction.

    The tag type can be a free form string, though MIRA will recognise and work with tag types used by the Staden gap4 package (and of course the MIRA tags as described in the main documentation of MIRA).

  • ST string: sequencing technology

    The current technologies can be defined: Sanger, 454, Solexa, SOLiD.

  • SN string: strain name

    Strain name of the sample that was sequenced, this is a free form string.

  • MT string: machine type

    Machine type which generated the data, this is a free form string.

  • BC string: base caller

    Base calling program used to call bases

  • IB boolean (0 or 1): is backbone

    Whether the read is a backbone. Reads used as reference (backbones) in mapping assemblies get this attribute.

  • IC boolean (0 or 1)

    Whether the read is a coverage equivalent read (e.g. from mapping Solexa). This is internal to MIRA.

  • IR boolean (0 or 1)

    Whether the read is a rail. This also is internal to MIRA.

  • ER

    This ends a read and is mandatory.

2.2.3.  Interpreting clipping values

Every left and right clipping pair (SL & SR, QL & QR, CL & CR) forms a clear range in the interval [left right[ in the sequence of a read. E.g. a read with SL=4 and SR=10 has the bases 1,2,3 clipped away on the left side, the bases 4,5,6,7,8,9 as clear range and the bases 10 and following clipped away on the right side.

The left clip of a read is determined as max(SL,QL,CL) (the rightmost left clip) whereas the right clip is min(SR,QR,CR).

2.3.  Contigs

Contigs are not much more than containers containing reads with some additional information. Contrary to CAF or ACE, MAF does not first store all reads in single containers and then define the contigs. In MAF, contigs are defined as outer container and within those, the reads are stored like normal reads.

2.3.1.  Simple example 2

The above example for a read can be encased in a contig like this (with two consensus tags gratuitously added in):

	  CO      contigname_s1
	  NR      1
	  LC      24
	  CS      TGCCTGCAGGTCGACTCTAGAAGG
	  CQ      -+/,36;:6≤3327<7A1/,,).
	  CT      COMM 5 8 Some comment to this consensus tag.
	  CT      COMM 7 12 Another comment to this consensus tag.
	  \\
	  RD      U13a05e07.t1
	  RS      CTTGCATGCCTGCAGGTCGACTCTAGAAGGACCCCGATCA
	  RQ      ,-+*,1-+/,36;:6≤3327<7A1/,,).('..7=@E8:
	  TN      U13a05e07
	  TF      1200
	  TT      1800
	  SF      U13a05e07.t1.scf
	  SL      4
	  SR      32
	  QL      7
	  QR      30
	  AO      1 40 1 40
	  RT      ALUS 10 15 Some comment to this read tag.
	  ST      Sanger
	  ER
	  AT      1 24 7 30
	  //
	  EC
	

Note that the read shown previously (and now encased in a contig) is absolutely unchanged. It has just been complemented with a bit of data which describes the contig as well as with a one liner which places the read into the contig.

2.3.2.  List of records for contigs

  • CO string: contig name

    CO starts a contig, the contig name behind is mandatory but can be any string, including numbers.

  • NR integer: num reads in contig

    This is optional but highly recommended.

  • LC integer: contig length

    Note that this length defines the length of the 'clear range' of the consensus. It is 100% equal to the length of the CS (sequence) and CQ (quality) strings below.

  • CT string + 2 integers + optional string: identifier x1 y1 comment

    Consensus tags are defined like read tags but apply to the consensus. Here too, the interval [x1 y1] is including and if x1 > y1, the tag is in reverse direction.

  • CS string: consensus sequence

    Sequence of a consensus is stored in RS.

  • CQ string: qualities

    Consensus Qualities are stored in FASTQ format, i.e., each quality value + 33 is written as single as ASCII character.

  • \\

    This marks the start of read data of this contig. After this, all reads are stored one after the other, just separated by an "AT" line (see below).

  • AT Four integers: x1 y1 x2 y2

    The AT (Assemble_To) line defines the placement of the read in the contig and follows immediately the closing "ER" of a read so that parsers do not need to perform time consuming string lookups. Every read in a contig has exactly one AT line.

    The interval [x2 y2] of the read (i.e., the unclipped data, also called the 'clear range') aligns with the interval [x1 y1] of the contig. If x1 > y1 (the contig positions), then the reverse complement of the read is aligned to the contig. For the read positions, x2 is always < y2.

  • //

    This marks the end of read data

  • EC

    This ends a contig and is mandatory