Table of Contents
“Ninety percent of success is just growing up. ” | ||
--Solomon Short |
convert_project is a tool to convert, extract and sometimes recalculate all kinds of data related to sequence assembly files.
More specifically, convert_project can
convert from multiple alignment files (CAF, MAF) to other multiple alignment files (CAF, MAF, ACE, SAM), and -- if wished -- selecting contigs by different criteria like name, length, coverage etc.
extract the consenus from multiple alignments in CAF and MAF format, writing it to any supported output format (FASTA, FASTQ, plain text, HTML, etc.) and -- if wished -- recalculating the consensus using the MIRA consensus engine with MIRA parameters
extract read sequences (clipped or unclipped) from multiple alignments and save to any supported format
Much more, need to document this.
…
…
-f
{ caf
| maf
| fasta
| fastq
| gbf
| phd
| fofnexp
}
“From-type”, the format of the input file. CAF and MAF files can contain full assemblies and/or unassembled (single) sequences while the other formats contain only unassembled sequences.
-t
{ ace
| asnp
| caf
| crlist
| cstats
| exp
| fasta
| fastq
| gbf
| hsnp
| html
| maf
| phd
| text
| tcs
| wig
}
[ -t … ]
“To-type”, the format of the output file. Multiple mentions of [-t] are allowed, in which case convert_project will convert to multiple types.
-a
Append. Results of conversion are appended to existing files instead of overwriting them.
-A MIRA-PARAMETERSTRING
Additional MIRA parameters. Allows to initialise the underlying MIRA routines with specific parameters. A use case can be, e.g., to recalculate a consensus of an assembly in a slightly different way (see also [-r]) than the one which is stored in assembly files. Example: to tell the consensus algorithm to use a minimum number of reads per group, use: "454_SETTINGS -CO:mrpg=4".
Consult the MIRA reference manual for a full list of MIRA parameters.
-C
Hard clip reads. When the input is a format which contains clipping points in sequences and the requested output consists of sequences of reads, only the unclipped parts of sequences will be saved as results.
-m
Make contigs. Encase single reads as contig singlets into a CAF/MAF file.
-n namefile
Name select. Only contigs or reads are selected for output which
name appears in
namefile
. namefile
is a
simple text file having one name entry per line.
-o offset
Offset of quality values in FASTQ files. Only valid if -f is FASTQ.
-R namestring
Rename contigs/singlets/reads with given name string to which a counter is added.
Known bug: will create duplicate names if input (CAF or MAF)contains contigs/singlets as well as free reads, i.e. reads not in contigs nor singlets.
The following switches will work only if the input file contains contigs (i.e., CAF or MAF with contig data). Though infrequent, note that both CAF and MAf can contain single reads only.
-M
Do not extract contigs (or their consensus), but the sequence of the reads they are composed of.
-N namefile
Name select, sorted. Only contigs are selected for output which
name appears in namefile
. Regardless of the
order of contigs in the input, the output is sorted according to the
appearance of names in
namefile
. namefile
is
a simple text file having one name entry per line.
Note that for this function to work, all contigs are loaded into memory which may be straining your RAM for larger projects.
-r
{ c
| C
| q
| f
}
Recalculate consensus and / or consensus quality values and / or SNP feature tags of an assembly. This feature is useful in case third party programs create own consensus sequences without handling different sequencing technologies (e.g. the combination of gap4 and caf2gap) or when the CAF/MAF files do not contain consensus sequences at all.
c
C
q
f
![]() | Note |
---|---|
Only the last of cCq is relevant, 'f' works as a switch and can be combined with the others (e.g. “-r Cf”). |
![]() | Note |
---|---|
If the CAF/MAF contains reads from multiple strains, recalculation of consensus & consensus qualities is forced, you can just influence whether IUPACs are used or not. This is due to the fact that CAF/MAF do not provide facilities to store consensus sequences from multiple strains. |
-s
Split. Split outpout into single files, one file per contig. Files are named according to name of contig.
-u
fillUp strain genomes. In assemblies made of multiple strains, holes in the consensus of a strain (bases 'N' or '@') can be filled up with the consensus of the other strains. Takes effect only when '-r' is active.
-q quality_value
Defines minimum quality a consensus base of a strain must have, consensus bases below this will be set to 'N'. Only used when -r is active.
-v coverage_value
Defines minimum coverage a consensus base of a strain must have, consensus bases with a coverage below this will be set to 'N'. Only used when -r is active.
-x length
Minimum length a contig (in full assemblies) or read (in single sequence files) must have. All contigs / reads with a length less than this value are discarded. Default: 0 (=switched off).
Note: this is of course not applied to reads in contigs! Contigs passing the [-x] length criterium and stored as complete assembly (CAF, MAF, ACE, etc.) still contain all their reads.
-X length
Similar to [-x], but applies only to clipped reads (input file format must have clipping points set to be effective).
-y contig_coverage
Minimum average contig coverage. Contigs with an average coverage less than this value are discarded.
-z min_reads
Minimum number of reads in contig. Contigs with less reads than this value are discarded.
-l line_length
On output of assemblies as text or HTML: number of bases shown in one alignment line. Default: 60.
-c endgap_character
On output of assemblies as text or HTML: character used to pad endgaps. Default: ' ' (a blank)
In the following examples, the CAF and MAF files used are expected to contain full assembly data like the files created by MIRA during an assembly or by the gap2caf program. CAF and MAF could be used interchangeably in these examples, depending on which format currently is available. In general though, MAF is faster to process and smaller on disk.
convert_project -f caf -t fasta -t wig -t ace source.caf dest
convert_project -f maf -t caf -x 2000 -y 10 source.caf dest
convert_project -f fastq -x 55 -R newname source.fastq dest
This example will fetch the contigs named bchoc_c14, ...3, ...5 and ...13 and save the result in exactly that order to a new file:
arcadia:/path/to/myProject$
ls -l
-rw-r--r-- 1 bach users 231698898 2007-10-21 15:16 bchoc_out.caf -rw-r--r-- 1 bach users 38 2007-10-21 15:16 contigs.lstarcadia:/path/to/myProject$
cat contigs.lst
bchoc_c14 bchoc_c3 bchoc_c5 bchoc_c13arcadia:/path/to/myProject$
convert_project -f caf -N contigs.lst bchoc_out.caf myfilteredresult
[...]arcadia:/path/to/myProject$
ls -l
-rw-r--r-- 1 bach users 231698898 2007-10-21 15:16 bchoc_out.caf -rw-r--r-- 1 bach users 38 2007-10-21 15:16 contigs.lst -rw-r--r-- 1 bach users 828726 2007-10-21 15:24 myfilteredresult.caf
mirabait
[options] {bait_file
} {input_file
} {output_basename
}
While input and output file can have any of the supported formats (see -f and -t options), the bait file needs to be in FASTA format.
mirabait selects reads from a read collection which are partly similar or equal to sequences defined as target baits. Similarity is defined by finding a user-adjustable number of common k-mers (sequences of k consecutive bases) which are the same in the bait sequences and the screened sequences to be selected, either in forward or reverse complement direction.
The search performed is exact, that is, sequences selected are guaranteed to have the required number of k-mers equal to the bait sequences while sequences not selected are guaranteed not have these.
-f
{ caf
| maf
| fasta
| fastq
| gbf
| phd
}
“From-type”, the format of the input file. Default: fastq.
-t
{ caf
| maf
| fasta
| fastq
}
“To-type”, the format of the output file. Default: format of the input.
Multiple mentions of -t are allowed, in which case the selected sequences are written to all file formats chosen.
-k k-mer-length
k-mer, length of bait in bases (≤32, default=31)
-n minoccurence
Minimum number of k-mers needed for a sequence to be selected. Default: 1.
-i
Inverse hit: selects only sequence that do not meet the -k and -n criteria.
-r
Does not check for hits in reverse complement direction.