Understanding Gaps

The Problem with Gaps

Gaps present a special problem for quality calculations, since unlike the bases read from the sequencer they are artifacts of alignment and do not share the same quality properties as bases. Nevertheless, there are clear cases where gaps need to be incorporated into the calculations in some form. These cases need to be handled appropriately and consistently.

Problems

Ultimately the problem with gaps is that in almost every case, gaps have no quality value associated with them. The only instances where gaps have quality values are in those very rare instances where they have been edited, and therefore are assigned an arbitrary high quality value (usually 99 or 40). In the ordinary cases, gaps have no quality value associated with them and are assigned 00. Therefore these gaps cannot be incorporated into the quality calculations without special handling.

This problem appears immediately in the case of consensus base calling in slices with gaps. Churchill and Waterman describe a consensus base calling algorithm, which calculates the consensus of the slice to be the base in {A,C,G,T,-} that has the highest probability based on a Bayesian combination of the probabilities of the bases in the slice. This technique fails in the presence of gaps, since there quality value is 0, so '-' will never be called as the consensus of the slice. Clearly, in the cases where '-' dominates the slice, '-' should be the consensus of the slice. At issue is what it means for '-' to dominate the slice.

The technique used by TIGR Assembler and TIGR Editor is if the number of gaps is greater than or equal to the number of non-gaps in the slice, then '-' is called as the consensus of the slice. Other programs, such as Celera Assembler or gap4 assign quality to gaps in arbitrary ways, and perform the Bayesian combination for each of {A,C,G,T,-}. Other options include if the count of gaps is strictly greater, or greater by some number, than the count of non-gaps, call the consensus '-'. A final option is to use the quality of the other bases in the slice, the neighboring slices, or globally to calculate a value for the quality of the gaps. This final method presents special challenges to libSlice, which was designed with the constraint that only a single slice at a time is necessary to perform the quality calculations.

Finally, consensus quality value calculations are problematic in the presence of gaps for the same reason, namely that gaps have no associated quality value. The consensus quality value is a measure of the confidence that the consensus is accurate. The actual value is the highest probability calculated using a Bayesian combination on the quality values in the slice, and represents the probability that the non-ambiguous consensus is the true consensus of the slice. At issue is how to calculate this probability in the presence of gaps and whether the presence of gaps should affect the consensus quality value when the consensus is not gap, as conflicting information.

Solution

libSlice solves these issues by using the quality value of the flanking bases in the gapped read to determine the quality value that should be used for the gap. The rule is the quality value of a gap is the minimum quality value of the flanking non-gap bases in that read. The intuition is that high quality gaps come from high quality reads.

Unfortunately, this information is not always available to libSlice, so it is the responsibility of client applications to find the minimum value and write that quality value directly into the slice before the quality calculations are performed. In the cases where this operation is not possible, the client applications should leave the quality value of each gap as 0, and libSlice will assign a default value (currently 20).

In addition, libSlice uses lower case letters to reflect that the consensus is ambiguous with respect to gap. For example, since 'M' is used to show ambiguity between 'A' or 'C', 'm' is used to show ambiguity between 'A', 'C' or 'M'. This also means that 'a' is an ambiguity code (between 'A' and '-') while 'A' is the unambiguous base.

$Date: 2005/07/29 02:55:17 $