Slice Tools
Home
SourceForge Page
getCoverage
Home
Usage Summary
Options Help
Output
File Formats
Examples
FAQs
Support Information
|
|
getCoverage
Michael Schatz
Initial Design by Pawel Gajer
Display coverage data from .contig and .qual files in a variety of modes at
specified positions. Output modes include full coverage data, summary data, and
Slice XML representations. Operations can be performed per contig id or for
all contigs in a .contig file, and over a specifed range of positions.
Usage Summary
Usage: getCoverage [mode] [coordinates] [options] <prefix>
<prefix> Prefix to .contig and .qual files
-a|--asmbl_id <asmbl_id> Specify single contig id in multi-contig file
(use 'all' for all contigs in file)
Display Modes
-------------
-s|--stats Calculate a summary of the coverage
-t|--tiling Calculate full coverage data
-l|--xml Display tiling in Slice XML format
-v|--vertical Display tiling in a greedy vertical representation
-H|--horizontal Display tiling in a simple horizontal representation
[none] Display a compact view of the coverage
Coordinate Options
------------------
For all modes except the compact view, the default is to calculate coverage
information for every position in the contig. For all modes, a range of
positions (a cut) can be specified.
-x {--xg} <pos> Specify ungapped {gapped} position for start of cut
Negative values cut from position to end of contig
-y {--yg} <pos> Specify ungapped {gapped} position for end of cut
Negative values specify offset from -x <pos> or end of contig
-z|--radius <size> Specify a circular cut, use in conjunction with -x {--xg}
-r <seqname> Use the coordinates of the read to specifiy the range
Other Options
-------------
-0 Use 0-based coordinates when displaying data (Default for XML)
-1 Use 1-based coordiantes when displaying data (Default for other modes)
--nogaps Display ungapped data only
--offset Generate read offset file for tiling mode
-p|--gapqv Assign gaps the minimum of the flanking qv's in XML mode
-M|--missing Allow missing quality values for reads referenced in contig
-G|--gapped Display gapped coordinates (for tcov and horizontal modes)
-u|--unstable Allow the read ids to vary with cut (implies -t)
-Q|--silent Do not display status messages
-R|--readid Show the Read index of each read in the idTbl in tiling mode
-S|--split Split output into separate files per contig (implies '-a all')
-i|--index Build or use an index on the quality values
-o|--outdir <path> Specify output directory
-c <mode> Recall the consensus in specified ambiguity mode for stats file
(1=No ambiguity, 2=Minimal Ambiguity, 3=Annotation, 4=Conic)
-C|--circular Assume reads that extend beyond the consensus are circular
This is *NOT* necessary for reads that are circular with
negative offsets.
Back to top
Options Help
Details for each option.
-h: |
Displays help on the options
Output is to stderr
|
-t:
--tiling:
|
Calculate tiling and write to file.
This mode is useful for displaying the bases and quality values that tile given consensus positions, and can easily be used within scripts.
|
-s:
--stats:
|
Calculate basic slice statistics and write to file.
This is mode is useful if you are mostly interested in the consensus or quality classes along the consensus.
|
-l:
--xml:
|
Print Slice XML of contig or a cut
For more information on slices and the associated Slice Tools please
visit the Main Page
|
-H:
--horizontal:
|
Print tiling information in a horizontal fashion similiar to cutAsm
A horizontal representation is a simple representation in which the consensus is displayed and the sequences which tile the consensus are shown beneath with appropriate alignment characters.
This is the only mode in which a given sequence is contiguous along a given row and was based on the output of cutAsm.
|
-v:
--vertical:
|
Print tiling information in a greedy vertical representation
The vertical representaion writes sequences vertically such that a sequence will be contiguous within a given column. It is called greedy because each sequence is placed in the first available column.
|
-x <asmPos>:
--ugap <asmPos>:
|
Sets ungapped assembly position for cut
Output is to stdout.
|
--xg <gAsmPos>:
--gap <gAsmPos>:
|
Sets gapped assembly position for cut
Output is to stdout.
|
-y <asmPos>:
--yugap <asmPos>:
|
Sets ungapped assembly position for cut end
Negative numbers specify a negative offset from ungapped assembly position,
or end of assembly if -x not specified.
|
--yg <gAsmPos>:
--ygap <gAsmPos>:
|
Sets gapped assembly position for cut end
Negative numbers specify a negative offset from gapped assembly position,
or end of assembly if --xg not specified.
|
-z <radius>:
--radius <radius>:
|
Sets radius for cut
Select a cut of a tiling <radius> positions to the left and right
of <asmPos> or <gAsmPos> (depending on whichever is specified).
Note: |
The default value of <radius> is 5, so by default a total of 11 bases centered on <asmPos> or <gAsmPos> will be displayed.
|
|
-r <seqname>:
|
Specify the cut coordinates based on the coordinates of a read within the assembly
The cut will specify exactly the range of the consensus which that read tiles.
|
-0: |
Display 0-based coordiantes (XML Default)
Display coordinates starting from 0.
Note: |
Coordinates specified on the command line are always 1-based.
|
|
-1: |
Display 1-based coordiantes (Default)
Display coordinates starting from 1.
Note: |
Coordinates specified on the command line are always 1-based.
|
|
-a <asmbl_id>:
--asmbl_id <asmbl_id>:
|
Sets assembly id to process
If prefix.contig contains more than one contig, process only contig
<asmbl_id>.
Note: |
Set <asmbl_id> to 'all' to process all contigs in the prefix.contig file.
If -a is not specified, then only the first contig in the contig file will be processed.
|
|
-b:
--basescoverage:
|
Calculate base coverage for stats
In addition to the depth of coverage, display the base coverage for each slice
in the contig in stats mode. Displayed before --consensus.
|
-c <mode>:
--consensus <mode>: |
Calculate a new consensus for stats
Use the slice information to calculate a new consensus for each slice in the
contig. Displayed last in the stats output.
Modes are:
0: Don't recompute the consensus
1: Recompute without ambiguity codes
2: Recompute with the Churchill and Waterman model
3: Recompute with the annotation ambiguity model
4: Recompute with the conic ambiguity model
See http://intranet/software_docs/libSlice/
for more information on the various modes of recalling the consensus.
|
-G:
--gapped:
|
Display gapped coordinates (for tcov and horizontal modes)
In tcov mode this will show the gapped coordinates in addition to the ungapped coordinates. In horizontal mode, it will show gapped coordinates instead of ungapped coordinates.
|
--nogaps:
|
Don't display slices with gap as the consensus
Slices with gaps as the consensus are displayed otherwise.
|
-i:
--index:
|
Build or use an index on the quality values
If an index is found for "prefix.qual" named "prefix.qual.idx", then getCoverage will use the index. Otherwise, it will create the index.
The use of an index can greatly speed up getCoverage operations.
|
-Q:
--silent:
|
Don't display status messages
Disables displaying of status messages.
|
-C:
--circular:
|
Assume reads that extend beyond the consensus are circular
This option is only necessary for reads that span the origin that extend
beyond the right edge of the contig. If this option is not specified, that it
will be assumed that the contig has been corrupted, and the sequence will be
truncated at the edge of the consensus.
Note: |
This option is *NOT* necessary for reads that are circular with negative offsets. Reads with negative offset will always be wrapped around the origin.
|
|
-R:
--readid:
|
Show read id for each read in idTbl
Shows the read id (line number) for each read in the idTbl in tcov mode.
|
-S:
--split:
|
Split results by contig id (Implies -a all)
Split the results into separate files for each contig id, especially for when
operating on multi-assembly contig files.
|
-u:
--unstable:
|
Allow for unstable read ids in tiling mode
Unstable read ids means the read ids will vary depending on the coordinates
of the cut, ie read 0 may reference different reads at different locations
of the contig. Using unstable readids allows for the contig to be processed
much faster in tcov mode.
|
--offset:
|
Generate read offset file for tiling
Print offset file prefix.tcov_offset whose i-th row contains the offset
of the i-th row of the prefix.tcov file. prefix.tcov_offset can be generated
only if the tcov file contains tiling data of a single contig.
|
-M:
--missing:
|
Allow missing quality values
Allows for quality values to be missing for some or all reads referenced
in a contig file. By default all bases are then assigned a quality value of 0.
|
-p:
--gapqv:
|
Assign quality values to gaps
Assigns gaps a quality value of the minimum of the quality values of the
flanking bases. This is useful for recalling the consensus exactly as it is
displayed in cloe.
|
-o:
--outdir:
|
Specify the directory for output
If not specified, use the same directory as the input for stats, vertical, and tcov.
Simple coverage, XML, and horizontal default to stdout unless -o is specified. If '-o -' is
specified, display output to stdout for all display types.
|
Back to top
Output
For all examples, assume dir/file.contig has two contigs listed: '##1' and '##2'.
For modes that default to stdout (-l, -H, compact):
|
% getCoverage -l dir/file.contig
Outputs to stdout slice of contig 1 to stdout.
|
% getCoverge -l dir/file.contig -a 2
Outputs to stdout slice of contig 2 to stdout.
|
% getCoverge -l dir/file.contig -o out
Creates out.slice with slice of contig 1
|
% getCoverge -l dir/file.contig -o out -a 2
Creates out.slice with slice of contig 2
|
% getCoverge -l dir/file.contig -o out -a all
Errors suggesting to use split mode
|
% getCoverage -l dir/file.contig -S
Creates dir/1.slice
Creates dir/2.slice
|
% getCoverage -l dir/file.contig -S -o other
Creates other/1.slice
Creates other/2.slice
|
|
For modes that default to file(s) (-t, -s, -v):
|
% getCoverage -t dir/file.contig
Creates dir/1.tcov (with just contig 1)
Creates dir/1.idTbl (with just contig 1)
|
% getCoverge -t dir/file.contig -a 2
Creates dir/2.tcov (with just contig 2)
Creates dir/2.idTbl (with just contig 2)
|
% getCoverge -t dir/file.contig -o out
Creates out.tcov (with just contig 1)
Creates out.idTbl (with just contig 1)
|
% getCoverge -t dir/file.contig -o out -a 2
Creates out.tcov (with just contig 2)
Creates out.idTbl (with just contig 2)
|
% getCoverge -t dir/file.contig -a all
Creates dir/file.tcov (with contigs 1&2)
Creates dir/file.idTbl (with contigs 1&2)
|
% getCoverge -t dir/file.contig -o out -a all
Creates out.tcov (with contigs 1&2)
Creates out.idTbl (with contigs 1&2)
|
% getCoverage -t dir/file.contig -S
Creates dir/1.tcov (with contig 1)
Creates dir/1.idTbl (with contig 1)
Creates dir/2.tcov (with contig 2)
Creates dir/2.idTbl (with contig 2)
|
% getCoverage -t dir/file.contig -S -o other
Creates other/1.tcov (with contig 1)
Creates other/1.idTbl (with contig 1)
Creates other/2.tcov (with contig 2)
Creates other/2.idTbl (with contig 2)
|
|
Warning: |
getCoverage will overwrite any existing files when generating output, i.e. if you rerun getCoverage in stats mode then the original stats file will be overwritten.
|
Back to top
File Formats
asmbl_id.tcov contains a coverage tiling information in the format:
<ungapped 1-based consensus position> <consensus qv> <base call coverage> <qv coverage> <read indices>
prefix.tcov contains the same information for each contig of the
prefix.contig file (if it has more than one contig), but the first column of
prefix.tcov is <asmbl_id> Thus the format of
the prefix.tcov file is:
<asmbl_id> <ungapped 1-based consensus position> <consensus qv> <base call coverage> <qv coverage> <read indices>
To simplify a display of coverage information each read is
assigned a positive integer value, called read index, that identifies
this read.
For example, a fragment of an asmbl_id.tcov file may look like this:
1091 C 221 CCCACCC 32:34:33:15:35:34:33 3:8:9:10:11:12:13
1092 G 229 GGGGGGG 32:34:33:34:35:36:25 3:8:9:10:11:12:13
1093 G 224 GGGGGGG 33:34:33:34:34:36:20 3:8:9:10:11:12:13
1094 C 198 CCCCCC 35:33:32:29:35:34 8:9:10:11:12:13
1095 A 200 AAAAAA 34:33:35:32:34:32 8:9:10:11:12:13
1096 G 183 GGGGGG 35:33:23:32:35:25 8:9:10:11:12:13
1097 G 175 GGGGGG 28:34:29:35:35:14 8:9:10:11:12:13
Lets examine the first row of this fragment.
1091 |
The ungapped position in the given contig.
All of the coverage information in this row pertains to this position.
|
C |
The consensus sequence value at the position 1091.
|
221 |
The consensus quality value, computed using Churchill-Waterman algorithm.
|
CCCACCC |
The bases covering the given position.
|
32:34:33:15:35:34:33 |
The quality values of the bases.
|
3:8:9:10:11:12:13 |
Indices of the corresponding bases and quality values.
That is the first base 'C' belongs to a read with read index 3, the second base belongs to a read with read index 8, etc. To decifer read ids from their indices we have to use an idTbl file.
prefix.idTbl or asmbl_id.idTbl file consists of two columns:
<seq_id> <rc flag>
The i-th row of the idTbl file has the seq_d and rc flag information for
the read with read index (i-1).
|
asmbl_id.tcovStats contains a coverage tiling information in the format:
1 <ungapped consensus position>
2 <gapped consensus position>
3 <consensus base>
4 <depth of coverage>
5 <consensus quality value (computed using Churchill-Waterman algorithm)>
6 <consensus quality class>
prefix.tcovStats contains the same information as asmbl_id.tcovStats for each contig of the
prefix.contig file (if it has more than one contig), and therefore we set the
first column of prefix.tcovStats to be <asmbl_id>
asmbl_id.vertical contains a vertical representation of the assembly where a given column is used for an entire sequence, and a two row blank is used to differentiate sequences.
Two blank lines in a row differentiate contigs.
For example:
A
C
C
GG
TA
T
G
CC
AA
This shows a very small contig consisting of three sequences.
The first has bases ACCGT, the second has GATGCA and the third is CA.
This is the same format that the realigner program requires for input.
This mode can be used to create inputs for the realigner program.
Back to top
Examples
In the examples that follow gbr is a prefix of gbr.contig and gbr.qual
files of the Brucella suis genome (GBR database), which contain assembly
information for multiple contigs. 10683.contig is a contig from DMG with
an associated 10683.qual file that lists just a single contig.
-
To print a fragment of the tiling data, five positions to the left
and five positions to the right from the ungapped assembly position 415 in
the assembly of the 10683.contig file execute:
getCoverage 10683 -x 415
The output is:
asmbl_id: 10683
Ungapped Consensus 0123 0 1 2 3
410 T TT.. (35 20 . .)
411 T TT.. (36 17 . .)
412 G GG.. (36 17 . .)
413 T TTT. (36 22 34 .)
414 G GGG. (36 21 30 .)
415 T TTT. (36 21 33 .)
416 T TTTT (34 26 26 25)
417 G GGGG (36 31 33 21)
418 T TTTT (36 36 36 19)
419 G GGGG (36 26 32 15)
420 T TTTT (36 33 36 15)
Sequence IDs RC Range
0: DMGLN85TF 0 <1 484>
1: DMGHB30TR 1 <384 523>
2: DMGIC39TF 1 <413 995>
3: DMGRA39TR 1 <416 620>
The first column lists ungapped assembly positions, the second column lists
consensus base values, and the third column displays base call coverage
data. 0123 at the top of the third column lists indices of the first, second,
third, and fourth respectively read that covers the corresponding fragment of
the assembly. seq_id corresponding to these indices are listed under the
'Sequence IDs RC' heading. If the sequence does not tile a consensus position,
then the base call and the quality value for that sequence at that position is '.'.
At the bottom, there is a summary of each read, displaying the seqname, rc flag
and assembly range from the contig file. The assembly range is the 1-based
ungapped coordinates of the range of the consensus that the read tiles.
Note that we didn't specify the value of -z parameter, because it is set
to 5 by default. If we wanted to display a tiling 10 bases to the left and right
from the position 415 we would have to execute:
getCoverage 10683 -x 415 -z 10
-
To display information of a small range near 123 in Slice XML format.
getCoverage 10683.contig -x 123 -z 1 --xml
The output is:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Request SYSTEM "http://slicetools.sourceforge.net/aserver.dtd">
<Request Type="Consensus" Host="getCoverage" User="getCoverage" Option="All">
<ReadCollection>
<Read Id="DMGLN85TF" Seqname="DMGLN85TF" Dir="0" Chem=""/>
</ReadCollection>
<SliceRange Index="121" Gindex="">
<Slice Index="0" Gindex="1" Existing="A">
<Nuc>A</Nuc>
<Qualval>27</Qualval>
<ReadID>DMGLN85TF</ReadID>
</Slice>
<Slice Index="1" Gindex="1" Existing="A">
<Nuc>A</Nuc>
<Qualval>34</Qualval>
<ReadID>DMGLN85TF</ReadID>
</Slice>
<Slice Index="2" Gindex="1" Existing="C">
<Nuc>C</Nuc>
<Qualval>32</Qualval>
<ReadID>DMGLN85TF</ReadID>
</Slice>
</SliceRange>
</Request>
By default, the output would have been the entire contig.
-
To display the tiling data in horizontal format
getCoverage 10683.contig -x 415 -H
The output is:
Assembly 10683: [410U, 420U] (1-based)
44444444444
11111111112
Position: 01234567890
Consensus: TTGTGTTGTGT
Covering Reads:
DMGLN85TF TTGTGTTGTGT
DMGHB30TR TTGTGTTGTGT
DMGIC39TF ...TGTTGTGT
DMGRA39TR ......TGTGT
This shows that read DMGIC39TF begins it tiling at ungapped 1-based position 413, and
DMGRA39TR begins at 416. You can also specify -G to display positions in 1-based gapped format.
0-based positions can be displayed by specifying -0 but the command line parameters (-x, -y) are always 1-based.
-
To print tcov, and idTbl files for just assembly 70265:
getCoverage gbr -t -a 70265
The output is placed in the files:
70265.tcov
70265.idTbl
-
To print only ungapped contig positions to the tcov file of the contig 70265,
execute:
getCoverage gbr -t --nogaps -a 70265
-
If asmbl_id 70265 had a lot of reads and the tcov file is very large,
you can generate a file 70265.tcov_offset whose i-th row contains the offset
of the i-th row of the 70265.tcov file. This can be used to quickly access
different fragments of the tcov file. To do this execute:
getCoverage gbr -t --nogaps --offset -a 70265
-
To calculate tiling information (with contig gaps included) for all contigs in
one tcov file, execute
getCoverage gbr -t -a all
Note that tcov file will be quite big for a large contig file.
In the above example:
SIZE FILE
-----------------
31M gbr.contig
257M gbr.tcov
564k gbr.idTbl
Back to top
FAQs
Q: |
When should I use an index (prefix.qual.idx)? |
A: |
The index allows for quality values to be loaded without first parsing the entire .qual file.
This can greatly speed up getCoverage when dealing with large assemblies.
Typically you would use the index when you will be making several calls to getCoverage (possibly with different ranges or modes) but with the same qual file.
There is no advantage to using the index if you will only be running getCoverage once with a specific qual file.
|
Q: |
What does getCoverage.warnings mean? |
A: |
getCoverage will create a file called getCoverage.warnings for errors it finds in assemblies that are not fatal.
The includes errors relating to the number of sequences misreported, the length of a sequence misreported, or sequences with invalid clear ranges (the quality values will be replaced with 0's).
In these cases, the warning usually indicates that there is a (possibly severe) problem with the data, but getCoverage will do its best effort to display the tiling as asked for.
In some circumstances, getCoverage will detect a fatal error with the contig and create a getCoverage.error file. This will list fatal error conditions including: not being able to open required files, missing quality values for sequences (when not run with -M), the requested asmbl_id not found in the contig file, the length of the consensus sequence misreported, the qual index has been corrupted, or other internal errors.
If getCoverage still causes problems after double checking your comand line parameters or removing the invalid index, please file a ticket with the Closure Software team.
|
Back to top
Support Information
Please visit the SliceTools SourceForge Page
for more information, or send email to slicetools-help [a t] lists . sourceforge . net.
Back to top
Last Updated: $Date: 2005/07/29 20:13:12 $
|