Utilities in the MIRA package

Bastien Chevreux

MIRA Version 3.4.1.1

Document revision $Id$

Table of Contents

1. convert_project
1.1. Synopsis
1.2. Description
1.3. Options
1.3.1. General options
1.3.2. Options for input containing contig data
1.4. Examples
2. mirabait
2.1. Synopsis
2.2. Description
2.3. Options
 

Ninety percent of success is just growing up.

 
 --Solomon Short

1. convert_project

1.1.  Synopsis

convert_project [options] {input_file} {output_basename}

1.2. Description

convert_project is a tool to convert, extract and sometimes recalculate all kinds of data related to sequence assembly files.

More specifically, convert_project can

  1. convert from multiple alignment files (CAF, MAF) to other multiple alignment files (CAF, MAF, ACE, SAM), and -- if wished -- selecting contigs by different criteria like name, length, coverage etc.

  2. extract the consenus from multiple alignments in CAF and MAF format, writing it to any supported output format (FASTA, FASTQ, plain text, HTML, etc.) and -- if wished -- recalculating the consensus using the MIRA consensus engine with MIRA parameters

  3. extract read sequences (clipped or unclipped) from multiple alignments and save to any supported format

  4. Much more, need to document this.

1.3. Options

1.3.1. General options

-f { caf | maf | fasta | fastq | gbf | phd | fofnexp }

From-type, the format of the input file. CAF and MAF files can contain full assemblies and/or unassembled (single) sequences while the other formats contain only unassembled sequences.

-t { ace | asnp | caf | crlist | cstats | exp | fasta | fastq | gbf | hsnp | html | maf | phd | text | tcs | wig } [ -t … ]

To-type, the format of the output file. Multiple mentions of [-t] are allowed, in which case convert_project will convert to multiple types.

-a

Append. Results of conversion are appended to existing files instead of overwriting them.

-A MIRA-PARAMETERSTRING

Additional MIRA parameters. Allows to initialise the underlying MIRA routines with specific parameters. A use case can be, e.g., to recalculate a consensus of an assembly in a slightly different way (see also [-r]) than the one which is stored in assembly files. Example: to tell the consensus algorithm to use a minimum number of reads per group, use: "454_SETTINGS -CO:mrpg=4".

Consult the MIRA reference manual for a full list of MIRA parameters.

-C

Hard clip reads. When the input is a format which contains clipping points in sequences and the requested output consists of sequences of reads, only the unclipped parts of sequences will be saved as results.

-m

Make contigs. Encase single reads as contig singlets into a CAF/MAF file.

-n namefile

Name select. Only contigs or reads are selected for output which name appears in namefile. namefile is a simple text file having one name entry per line.

-o offset

Offset of quality values in FASTQ files. Only valid if -f is FASTQ.

-R namestring

Rename contigs/singlets/reads with given name string to which a counter is added.

Known bug: will create duplicate names if input (CAF or MAF)contains contigs/singlets as well as free reads, i.e. reads not in contigs nor singlets.

1.3.2. Options for input containing contig data

The following switches will work only if the input file contains contigs (i.e., CAF or MAF with contig data). Though infrequent, note that both CAF and MAf can contain single reads only.

-M

Do not extract contigs (or their consensus), but the sequence of the reads they are composed of.

-N namefile

Name select, sorted. Only contigs are selected for output which name appears in namefile. Regardless of the order of contigs in the input, the output is sorted according to the appearance of names in namefile. namefile is a simple text file having one name entry per line.

Note that for this function to work, all contigs are loaded into memory which may be straining your RAM for larger projects.

-r { c | C | q | f }

Recalculate consensus and / or consensus quality values and / or SNP feature tags of an assembly. This feature is useful in case third party programs create own consensus sequences without handling different sequencing technologies (e.g. the combination of gap4 and caf2gap) or when the CAF/MAF files do not contain consensus sequences at all.

c
recalculate consensus & consensus qualities using IUPAC where necessary
C
recalculate consensus & consensus qualities forcing ACGT calls and without IUPAC codes
q
recalculate consensus quality values only
f
recalculate SNP features
[Note]Note
Only the last of cCq is relevant, 'f' works as a switch and can be combined with the others (e.g. -r Cf).
[Note]Note
If the CAF/MAF contains reads from multiple strains, recalculation of consensus & consensus qualities is forced, you can just influence whether IUPACs are used or not. This is due to the fact that CAF/MAF do not provide facilities to store consensus sequences from multiple strains.
-s

Split. Split outpout into single files, one file per contig. Files are named according to name of contig.

-u

fillUp strain genomes. In assemblies made of multiple strains, holes in the consensus of a strain (bases 'N' or '@') can be filled up with the consensus of the other strains. Takes effect only when '-r' is active.

-q quality_value

Defines minimum quality a consensus base of a strain must have, consensus bases below this will be set to 'N'. Only used when -r is active.

-v coverage_value

Defines minimum coverage a consensus base of a strain must have, consensus bases with a coverage below this will be set to 'N'. Only used when -r is active.

-x length

Minimum length a contig (in full assemblies) or read (in single sequence files) must have. All contigs / reads with a length less than this value are discarded. Default: 0 (=switched off).

Note: this is of course not applied to reads in contigs! Contigs passing the [-x] length criterium and stored as complete assembly (CAF, MAF, ACE, etc.) still contain all their reads.

-X length

Similar to [-x], but applies only to clipped reads (input file format must have clipping points set to be effective).

-y contig_coverage

Minimum average contig coverage. Contigs with an average coverage less than this value are discarded.

-z min_reads

Minimum number of reads in contig. Contigs with less reads than this value are discarded.

-l line_length

On output of assemblies as text or HTML: number of bases shown in one alignment line. Default: 60.

-c endgap_character

On output of assemblies as text or HTML: character used to pad endgaps. Default: ' ' (a blank)

1.4. Examples

In the following examples, the CAF and MAF files used are expected to contain full assembly data like the files created by MIRA during an assembly or by the gap2caf program. CAF and MAF could be used interchangeably in these examples, depending on which format currently is available. In general though, MAF is faster to process and smaller on disk.

Simple conversion: the consensus of an assembly to FASTA, at the same time coverage data for contigs to WIG and furthermore translate the CAF to ACE:
convert_project -f caf -t fasta -t wig -t ace source.caf dest
Filtering an assembly for contigs of length ≥2000 and an average coverage ≥ 10, while translating from MAF to CAF:
convert_project -f maf -t caf -x 2000 -y 10 source.caf dest
Filtering a FASTQ file for reads ≥ 55 basepairs, rename the selected reads with a string starting newname and save them back to FASTQ. Note how [-t fastq] was left out as the default behaviour of convert_project is to use the same "to" type as the input type ( [-f]).
convert_project -f fastq -x 55 -R newname source.fastq dest
Filtering and sortig contigs of an assembly according to external contig name list.

This example will fetch the contigs named bchoc_c14, ...3, ...5 and ...13 and save the result in exactly that order to a new file:

arcadia:/path/to/myProject$ ls -l
-rw-r--r-- 1 bach users  231698898 2007-10-21 15:16 bchoc_out.caf
-rw-r--r-- 1 bach users         38 2007-10-21 15:16 contigs.lst
arcadia:/path/to/myProject$ cat contigs.lst
bchoc_c14
bchoc_c3
bchoc_c5
bchoc_c13
arcadia:/path/to/myProject$ convert_project -f caf -N contigs.lst bchoc_out.caf myfilteredresult
[...]
arcadia:/path/to/myProject$ ls -l
-rw-r--r-- 1 bach users  231698898 2007-10-21 15:16 bchoc_out.caf
-rw-r--r-- 1 bach users         38 2007-10-21 15:16 contigs.lst
-rw-r--r-- 1 bach users     828726 2007-10-21 15:24 myfilteredresult.caf

2. mirabait

2.1.  Synopsis

mirabait [options] {bait_file} {input_file} {output_basename}

While input and output file can have any of the supported formats (see -f and -t options), the bait file needs to be in FASTA format.

2.2. Description

mirabait selects reads from a read collection which are partly similar or equal to sequences defined as target baits. Similarity is defined by finding a user-adjustable number of common k-mers (sequences of k consecutive bases) which are the same in the bait sequences and the screened sequences to be selected, either in forward or reverse complement direction.

The search performed is exact, that is, sequences selected are guaranteed to have the required number of k-mers equal to the bait sequences while sequences not selected are guaranteed not have these.

2.3. Options

-f { caf | maf | fasta | fastq | gbf | phd }

From-type, the format of the input file. Default: fastq.

-t { caf | maf | fasta | fastq }

To-type, the format of the output file. Default: format of the input.

Multiple mentions of -t are allowed, in which case the selected sequences are written to all file formats chosen.

-k k-mer-length

k-mer, length of bait in bases (≤32, default=31)

-n minoccurence

Minimum number of k-mers needed for a sequence to be selected. Default: 1.

-i

Inverse hit: selects only sequence that do not meet the -k and -n criteria.

-r

Does not check for hits in reverse complement direction.