Understanding Ambiguity Codes

Churchill and Waterman Ambiguity Codes

Churchill and Waterman describe an algorithm for choosing ambiguity codes using the probability of each base in a slice. The computation begins by calculating five composite probabilities for A,C,G,T, and -, that the corresponding base is the consensus of the slice. In those cases where the most probable base alone is not very likely to be the consensus, additional bases are added to an ambiguity set in order from highest to least probable until the combined probability of the ambiguity set reaches the high quality threshold. The resultant ambiguity set is mapped back into a single letter ambiguity code.

Problems

The biggest problem with the Churchill and Waterman procedure is that it often calculates an ambiguity set which includes bases that were not present in the slice. For example, if the slice contains {A,C} that are both at low quality, the Churchill and Waterman procedure will find an ambiguity set ranging from {A,C}='M', {A,C,G}='V', {A,C,G,T}='N', or {A,C,G,T,-}='N*' depending on the exact qualities for {A,C} in the slice. This leads to very non-intuitive ambiguity codes that include "phantom" bases that were not present in the slice. The exact ambiguity code chosen in these cases is very sensitive to the quality values of the bases in the slice.

Another issue is that the ambiguity set defined by Churchill and Waterman can include gaps in the ambiguity code, but IUPAC does not define ambiguity codes for gaps. Churchill and Waterman call the IUPAC with gaps extended ambiguity codes and are named by taking the IUPAC and adding * for the gap, ie (A,-) is A* and (A,C,-)=M*. The problem is how to represent an extended ambiguity code as a single letter.

A final issue is that the Churchill and Waterman calculation is very sensitive to the quality values in the slice, but is not sensitive to the depth of coverage so that it tends to undercall ambiguity calls. This leads to results at moderate to high coverage regions where one would expect an ambiguity code as the consensus, but an unambiguous base is called. In these circumstances, the ambiguity code should include bases of lower probability than would be included using the Churchill and Waterman procedure.

Solution

The problem of representing extended ambiguity codes is solved by mapping extended ambiguity codes onto lower case letters. In this scheme, uppercase letter always represent the standard IUPAC codes, and lowercase letters represent the IUPAC code with gap, ie. A*='a', M*='m'.

The problem of phantom bases included in the ambiguity set is solved by limiting the set of bases tested for the ambiguity set to those that are present in the slice. For example in the slice {A,C}, the only possible ambiguity sets are {A}='A', {C}='C', and {A,C}='M'. 'V' or 'N' will not be picked, nor will any of the lowercase extended ambiguities, since no gaps are in the slice. In essence, this procedure finds the minimum ambiguity code to represent the slice, even if this code does not have an combined probability that reaches the high quality threshold.

The final issue is solved by libSlice by providing several different consensus calling models that can be activated for different needs. The first model uses the Churchill and Waterman algorithm only with the modification for phantom bases, this is known as the "Minimal" ambiguity model. The second model extends the first model by transforming all high quality disagreements into ambiguity codes. This model is called the "Annotation" ambiguity model, and can be useful for finding SNPs, but tends to call more ambiguity codes than otherwise useful. The third model is a "Conic" consensus caller, which addresses the shortcomings of both of the other models and is sensitive to the quality values of the bases in the slice, and the depth of coverage. See the Conic Ambiguity Model page for more information.

$Date: 2005/07/29 02:55:17 $