Slice Tools
  • Home
  • SourceForge Page


  • getCoverage
  • Home
  • Usage Summary
  • Options Help
  • Output
  • File Formats
  • Examples
  • FAQs
  • Support Information


  • SourceForge.net Logo
      getCoverage

    Michael Schatz
    Initial Design by Pawel Gajer

    Display coverage data from .contig and .qual files in a variety of modes at specified positions. Output modes include full coverage data, summary data, and Slice XML representations. Operations can be performed per contig id or for all contigs in a .contig file, and over a specifed range of positions.

    Usage Summary

      Usage: getCoverage [mode] [coordinates] [options] <prefix>
    
        <prefix> Prefix to .contig and .qual files
    
        -a|--asmbl_id <asmbl_id> Specify single contig id in multi-contig file
                                 (use 'all' for all contigs in file)
    
      Display Modes
      -------------
      -s|--stats      Calculate a summary of the coverage 
      -t|--tiling     Calculate full coverage data
      -l|--xml        Display tiling in Slice XML format 
      -v|--vertical   Display tiling in a greedy vertical representation
      -H|--horizontal Display tiling in a simple horizontal representation
      [none]          Display a compact view of the coverage 
    
      Coordinate Options
      ------------------
      For all modes except the compact view, the default is to calculate coverage 
      information for every position in the contig. For all modes, a range of
      positions (a cut) can be specified.
      
      -x {--xg} <pos> Specify ungapped {gapped} position for start of cut
                      Negative values cut from position to end of contig
    
      -y {--yg} <pos> Specify ungapped {gapped} position for end of cut
                      Negative values specify offset from -x <pos> or end of contig
    
      -z|--radius <size> Specify a circular cut, use in conjunction with -x {--xg}
      -r <seqname>  Use the coordinates of the read to specifiy the range
    
      Other Options
      -------------
      -0 Use 0-based coordinates when displaying data (Default for XML)
      -1 Use 1-based coordiantes when displaying data (Default for other modes)
    
      --nogaps      Display ungapped data only
      --offset      Generate read offset file for tiling mode
      -p|--gapqv    Assign gaps the minimum of the flanking qv's in XML mode
      -M|--missing  Allow missing quality values for reads referenced in contig
      -G|--gapped   Display gapped coordinates (for tcov and horizontal modes)
      -u|--unstable Allow the read ids to vary with cut (implies -t)
      -Q|--silent   Do not display status messages
      -R|--readid   Show the Read index of each read in the idTbl in tiling mode
      -S|--split    Split output into separate files per contig (implies '-a all')
      -i|--index    Build or use an index on the quality values
      -o|--outdir   <path> Specify output directory
      -c <mode>     Recall the consensus in specified ambiguity mode for stats file
                    (1=No ambiguity, 2=Minimal Ambiguity, 3=Annotation, 4=Conic)
      -C|--circular Assume reads that extend beyond the consensus are circular
                    This is *NOT* necessary for reads that are circular with 
                    negative offsets.
    
    
    Back to top

    Options Help

    Details for each option.
    -h: Displays help on the options
    Output is to stderr
    -t:
    --tiling:
    Calculate tiling and write to file.
    This mode is useful for displaying the bases and quality values that tile given consensus positions, and can easily be used within scripts.
    -s:
    --stats:
    Calculate basic slice statistics and write to file.
    This is mode is useful if you are mostly interested in the consensus or quality classes along the consensus.
    -l:
    --xml:
    Print Slice XML of contig or a cut
    For more information on slices and the associated Slice Tools please visit the Main Page
    -H:
    --horizontal:
    Print tiling information in a horizontal fashion similiar to cutAsm
    A horizontal representation is a simple representation in which the consensus is displayed and the sequences which tile the consensus are shown beneath with appropriate alignment characters. This is the only mode in which a given sequence is contiguous along a given row and was based on the output of cutAsm.
    -v:
    --vertical:
    Print tiling information in a greedy vertical representation
    The vertical representaion writes sequences vertically such that a sequence will be contiguous within a given column. It is called greedy because each sequence is placed in the first available column.
    -x <asmPos>:
    --ugap <asmPos>:
    Sets ungapped assembly position for cut
    Output is to stdout.
    --xg <gAsmPos>:
    --gap <gAsmPos>:
    Sets gapped assembly position for cut
    Output is to stdout.
    -y <asmPos>:
    --yugap <asmPos>:
    Sets ungapped assembly position for cut end
    Negative numbers specify a negative offset from ungapped assembly position, or end of assembly if -x not specified.
    --yg <gAsmPos>:
    --ygap <gAsmPos>:
    Sets gapped assembly position for cut end
    Negative numbers specify a negative offset from gapped assembly position, or end of assembly if --xg not specified.
    -z <radius>:
    --radius <radius>:
    Sets radius for cut
    Select a cut of a tiling <radius> positions to the left and right of <asmPos> or <gAsmPos> (depending on whichever is specified).

    Note: The default value of <radius> is 5, so by default a total of 11 bases centered on <asmPos> or <gAsmPos> will be displayed.
    -r <seqname>: Specify the cut coordinates based on the coordinates of a read within the assembly
    The cut will specify exactly the range of the consensus which that read tiles.
    -0: Display 0-based coordiantes (XML Default)
    Display coordinates starting from 0.

    Note: Coordinates specified on the command line are always 1-based.
    -1: Display 1-based coordiantes (Default)
    Display coordinates starting from 1.

    Note: Coordinates specified on the command line are always 1-based.
    -a <asmbl_id>:
    --asmbl_id <asmbl_id>:
    Sets assembly id to process
    If prefix.contig contains more than one contig, process only contig <asmbl_id>.

    Note: Set <asmbl_id> to 'all' to process all contigs in the prefix.contig file.
    If -a is not specified, then only the first contig in the contig file will be processed.
    -b:
    --basescoverage:
    Calculate base coverage for stats
    In addition to the depth of coverage, display the base coverage for each slice in the contig in stats mode. Displayed before --consensus.
    -c <mode>:
    --consensus <mode>:
    Calculate a new consensus for stats
    Use the slice information to calculate a new consensus for each slice in the contig. Displayed last in the stats output.

    Modes are:
    0: Don't recompute the consensus
    1: Recompute without ambiguity codes
    2: Recompute with the Churchill and Waterman model
    3: Recompute with the annotation ambiguity model
    4: Recompute with the conic ambiguity model

    See http://intranet/software_docs/libSlice/ for more information on the various modes of recalling the consensus.
    -G:
    --gapped:
    Display gapped coordinates (for tcov and horizontal modes)
    In tcov mode this will show the gapped coordinates in addition to the ungapped coordinates. In horizontal mode, it will show gapped coordinates instead of ungapped coordinates.
    --nogaps: Don't display slices with gap as the consensus
    Slices with gaps as the consensus are displayed otherwise.
    -i:
    --index:
    Build or use an index on the quality values
    If an index is found for "prefix.qual" named "prefix.qual.idx", then getCoverage will use the index. Otherwise, it will create the index. The use of an index can greatly speed up getCoverage operations.
    -Q:
    --silent:
    Don't display status messages
    Disables displaying of status messages.
    -C:
    --circular:
    Assume reads that extend beyond the consensus are circular
    This option is only necessary for reads that span the origin that extend beyond the right edge of the contig. If this option is not specified, that it will be assumed that the contig has been corrupted, and the sequence will be truncated at the edge of the consensus.
    Note: This option is *NOT* necessary for reads that are circular with negative offsets. Reads with negative offset will always be wrapped around the origin.
    -R:
    --readid:
    Show read id for each read in idTbl
    Shows the read id (line number) for each read in the idTbl in tcov mode.
    -S:
    --split:
    Split results by contig id (Implies -a all)
    Split the results into separate files for each contig id, especially for when operating on multi-assembly contig files.
    -u:
    --unstable:
    Allow for unstable read ids in tiling mode
    Unstable read ids means the read ids will vary depending on the coordinates of the cut, ie read 0 may reference different reads at different locations of the contig. Using unstable readids allows for the contig to be processed much faster in tcov mode.
    --offset: Generate read offset file for tiling
    Print offset file prefix.tcov_offset whose i-th row contains the offset of the i-th row of the prefix.tcov file. prefix.tcov_offset can be generated only if the tcov file contains tiling data of a single contig.
    -M:
    --missing:
    Allow missing quality values
    Allows for quality values to be missing for some or all reads referenced in a contig file. By default all bases are then assigned a quality value of 0.
    -p:
    --gapqv:
    Assign quality values to gaps
    Assigns gaps a quality value of the minimum of the quality values of the flanking bases. This is useful for recalling the consensus exactly as it is displayed in cloe.
    -o:
    --outdir:
    Specify the directory for output
    If not specified, use the same directory as the input for stats, vertical, and tcov. Simple coverage, XML, and horizontal default to stdout unless -o is specified. If '-o -' is specified, display output to stdout for all display types.

    Back to top

    Output

    For all examples, assume dir/file.contig has two contigs listed: '##1' and '##2'.
    For modes that default to stdout (-l, -H, compact):
    % getCoverage -l dir/file.contig
    Outputs to stdout slice of contig 1 to stdout.
    % getCoverge -l dir/file.contig -a 2
    Outputs to stdout slice of contig 2 to stdout.
    % getCoverge -l dir/file.contig -o out
    Creates out.slice with slice of contig 1
    % getCoverge -l dir/file.contig -o out -a 2
    Creates out.slice with slice of contig 2
    % getCoverge -l dir/file.contig -o out -a all
    Errors suggesting to use split mode
    % getCoverage -l dir/file.contig -S
    Creates dir/1.slice
    Creates dir/2.slice
    % getCoverage -l dir/file.contig -S -o other
    Creates other/1.slice
    Creates other/2.slice
    For modes that default to file(s) (-t, -s, -v):
    % getCoverage -t dir/file.contig
    Creates dir/1.tcov (with just contig 1)
    Creates dir/1.idTbl (with just contig 1)
    % getCoverge -t dir/file.contig -a 2
    Creates dir/2.tcov (with just contig 2)
    Creates dir/2.idTbl (with just contig 2)
    % getCoverge -t dir/file.contig -o out
    Creates out.tcov (with just contig 1)
    Creates out.idTbl (with just contig 1)
    % getCoverge -t dir/file.contig -o out -a 2
    Creates out.tcov (with just contig 2)
    Creates out.idTbl (with just contig 2)
    % getCoverge -t dir/file.contig -a all
    Creates dir/file.tcov (with contigs 1&2)
    Creates dir/file.idTbl (with contigs 1&2)
    % getCoverge -t dir/file.contig -o out -a all
    Creates out.tcov (with contigs 1&2)
    Creates out.idTbl (with contigs 1&2)
    % getCoverage -t dir/file.contig -S
    Creates dir/1.tcov (with contig 1)
    Creates dir/1.idTbl (with contig 1)
    Creates dir/2.tcov (with contig 2)
    Creates dir/2.idTbl (with contig 2)
    % getCoverage -t dir/file.contig -S -o other
    Creates other/1.tcov (with contig 1)
    Creates other/1.idTbl (with contig 1)
    Creates other/2.tcov (with contig 2)
    Creates other/2.idTbl (with contig 2)


    Warning: getCoverage will overwrite any existing files when generating output, i.e. if you rerun getCoverage in stats mode then the original stats file will be overwritten.
    Back to top

    File Formats

    asmbl_id.tcov contains a coverage tiling information in the format:
        <ungapped 1-based consensus position> <consensus qv> <base call coverage> <qv coverage> <read indices>
    
    prefix.tcov contains the same information for each contig of the prefix.contig file (if it has more than one contig), but the first column of prefix.tcov is <asmbl_id> Thus the format of the prefix.tcov file is:
        <asmbl_id> <ungapped 1-based consensus position> <consensus qv> <base call coverage> <qv coverage> <read indices>
    
    To simplify a display of coverage information each read is assigned a positive integer value, called read index, that identifies this read.

    For example, a fragment of an asmbl_id.tcov file may look like this:
        1091 C 221 CCCACCC 32:34:33:15:35:34:33 3:8:9:10:11:12:13
        1092 G 229 GGGGGGG 32:34:33:34:35:36:25 3:8:9:10:11:12:13
        1093 G 224 GGGGGGG 33:34:33:34:34:36:20 3:8:9:10:11:12:13
        1094 C 198 CCCCCC 35:33:32:29:35:34 8:9:10:11:12:13
        1095 A 200 AAAAAA 34:33:35:32:34:32 8:9:10:11:12:13
        1096 G 183 GGGGGG 35:33:23:32:35:25 8:9:10:11:12:13
        1097 G 175 GGGGGG 28:34:29:35:35:14 8:9:10:11:12:13
    
    Lets examine the first row of this fragment.
    1091 The ungapped position in the given contig.
    All of the coverage information in this row pertains to this position.
    C The consensus sequence value at the position 1091.
    221 The consensus quality value, computed using Churchill-Waterman algorithm.
    CCCACCC The bases covering the given position.
    32:34:33:15:35:34:33 The quality values of the bases.
    3:8:9:10:11:12:13 Indices of the corresponding bases and quality values.
    That is the first base 'C' belongs to a read with read index 3, the second base belongs to a read with read index 8, etc. To decifer read ids from their indices we have to use an idTbl file.

    prefix.idTbl or asmbl_id.idTbl file consists of two columns:
        <seq_id> <rc flag>	
    
    The i-th row of the idTbl file has the seq_d and rc flag information for the read with read index (i-1).


    asmbl_id.tcovStats contains a coverage tiling information in the format:
    1 <ungapped consensus position> 
    2 <gapped consensus position> 
    3 <consensus base> 
    4 <depth of coverage> 
    5 <consensus quality value (computed using Churchill-Waterman algorithm)> 
    6 <consensus quality class>
    
    prefix.tcovStats contains the same information as asmbl_id.tcovStats for each contig of the prefix.contig file (if it has more than one contig), and therefore we set the first column of prefix.tcovStats to be <asmbl_id>

    asmbl_id.vertical contains a vertical representation of the assembly where a given column is used for an entire sequence, and a two row blank is used to differentiate sequences. Two blank lines in a row differentiate contigs. For example:
    A
    C
    C
    GG
    TA
     T
     G
    CC
    AA
    
    
    This shows a very small contig consisting of three sequences. The first has bases ACCGT, the second has GATGCA and the third is CA. This is the same format that the realigner program requires for input.
    This mode can be used to create inputs for the realigner program.

    Back to top

    Examples

    In the examples that follow gbr is a prefix of gbr.contig and gbr.qual files of the Brucella suis genome (GBR database), which contain assembly information for multiple contigs. 10683.contig is a contig from DMG with an associated 10683.qual file that lists just a single contig.
    1. To print a fragment of the tiling data, five positions to the left and five positions to the right from the ungapped assembly position 415 in the assembly of the 10683.contig file execute:
          getCoverage 10683 -x 415
      
      The output is:
      asmbl_id: 10683
      Ungapped  Consensus  0123   0  1  2  3
           410          T  TT.. (35 20  .  .)
           411          T  TT.. (36 17  .  .)
           412          G  GG.. (36 17  .  .)
           413          T  TTT. (36 22 34  .)
           414          G  GGG. (36 21 30  .)
           415          T  TTT. (36 21 33  .)
           416          T  TTTT (34 26 26 25)
           417          G  GGGG (36 31 33 21)
           418          T  TTTT (36 36 36 19)
           419          G  GGGG (36 26 32 15)
           420          T  TTTT (36 33 36 15)
      
      Sequence IDs  RC  Range
      0: DMGLN85TF 0 <1 484>
      1: DMGHB30TR 1 <384 523>
      2: DMGIC39TF 1 <413 995>
      3: DMGRA39TR 1 <416 620>
      

      The first column lists ungapped assembly positions, the second column lists consensus base values, and the third column displays base call coverage data. 0123 at the top of the third column lists indices of the first, second, third, and fourth respectively read that covers the corresponding fragment of the assembly. seq_id corresponding to these indices are listed under the 'Sequence IDs RC' heading. If the sequence does not tile a consensus position, then the base call and the quality value for that sequence at that position is '.'. At the bottom, there is a summary of each read, displaying the seqname, rc flag and assembly range from the contig file. The assembly range is the 1-based ungapped coordinates of the range of the consensus that the read tiles.

      Note that we didn't specify the value of -z parameter, because it is set to 5 by default. If we wanted to display a tiling 10 bases to the left and right from the position 415 we would have to execute:
          getCoverage 10683 -x 415 -z 10
      
    2. To display information of a small range near 123 in Slice XML format.
          getCoverage 10683.contig -x 123 -z 1 --xml
      
      The output is:
      <?xml version="1.0" encoding="UTF-8"?>
      <!DOCTYPE Request SYSTEM "http://slicetools.sourceforge.net/aserver.dtd">
      <Request Type="Consensus" Host="getCoverage" User="getCoverage" Option="All">
        <ReadCollection>
          <Read Id="DMGLN85TF" Seqname="DMGLN85TF" Dir="0" Chem=""/>
        </ReadCollection>
        <SliceRange Index="121" Gindex="">
          <Slice Index="0" Gindex="1" Existing="A">
            <Nuc>A</Nuc>
            <Qualval>27</Qualval>
            <ReadID>DMGLN85TF</ReadID>
          </Slice>
          <Slice Index="1" Gindex="1" Existing="A">
            <Nuc>A</Nuc>
            <Qualval>34</Qualval>
            <ReadID>DMGLN85TF</ReadID>
          </Slice>
          <Slice Index="2" Gindex="1" Existing="C">
            <Nuc>C</Nuc>
            <Qualval>32</Qualval>
            <ReadID>DMGLN85TF</ReadID>
          </Slice>
        </SliceRange>
      </Request>
      
      By default, the output would have been the entire contig.

    3. To display the tiling data in horizontal format
          getCoverage 10683.contig -x 415 -H
      
      The output is:
      Assembly 10683: [410U, 420U] (1-based)
      
                	44444444444
                	11111111112
      Position: 	01234567890
      
      Consensus:	TTGTGTTGTGT
      
      Covering Reads:
      DMGLN85TF	TTGTGTTGTGT
      DMGHB30TR	TTGTGTTGTGT
      DMGIC39TF	...TGTTGTGT
      DMGRA39TR	......TGTGT
      
      
      This shows that read DMGIC39TF begins it tiling at ungapped 1-based position 413, and DMGRA39TR begins at 416. You can also specify -G to display positions in 1-based gapped format. 0-based positions can be displayed by specifying -0 but the command line parameters (-x, -y) are always 1-based.

    4. To print tcov, and idTbl files for just assembly 70265:
          getCoverage gbr -t -a 70265
      
      The output is placed in the files:
          70265.tcov
          70265.idTbl
      

    5. To print only ungapped contig positions to the tcov file of the contig 70265, execute:
          getCoverage gbr -t --nogaps -a 70265
      
    6. If asmbl_id 70265 had a lot of reads and the tcov file is very large, you can generate a file 70265.tcov_offset whose i-th row contains the offset of the i-th row of the 70265.tcov file. This can be used to quickly access different fragments of the tcov file. To do this execute:
          getCoverage gbr -t --nogaps --offset -a 70265
      
    7. To calculate tiling information (with contig gaps included) for all contigs in one tcov file, execute
          getCoverage gbr -t -a all
      
      Note that tcov file will be quite big for a large contig file. In the above example:
          SIZE  FILE
          -----------------
          31M   gbr.contig
          257M  gbr.tcov
          564k  gbr.idTbl
      

    Back to top

    FAQs

    Q: When should I use an index (prefix.qual.idx)?
    A: The index allows for quality values to be loaded without first parsing the entire .qual file. This can greatly speed up getCoverage when dealing with large assemblies. Typically you would use the index when you will be making several calls to getCoverage (possibly with different ranges or modes) but with the same qual file. There is no advantage to using the index if you will only be running getCoverage once with a specific qual file.
    Q: What does getCoverage.warnings mean?
    A: getCoverage will create a file called getCoverage.warnings for errors it finds in assemblies that are not fatal. The includes errors relating to the number of sequences misreported, the length of a sequence misreported, or sequences with invalid clear ranges (the quality values will be replaced with 0's). In these cases, the warning usually indicates that there is a (possibly severe) problem with the data, but getCoverage will do its best effort to display the tiling as asked for.

    In some circumstances, getCoverage will detect a fatal error with the contig and create a getCoverage.error file. This will list fatal error conditions including: not being able to open required files, missing quality values for sequences (when not run with -M), the requested asmbl_id not found in the contig file, the length of the consensus sequence misreported, the qual index has been corrupted, or other internal errors. If getCoverage still causes problems after double checking your comand line parameters or removing the invalid index, please file a ticket with the Closure Software team.

    Back to top

    Support Information

    Please visit the SliceTools SourceForge Page for more information, or send email to slicetools-help [a t] lists . sourceforge . net.

    Back to top


    Last Updated: $Date: 2005/07/29 20:13:12 $