# «A Rate-independent Technique for Analysis of Nucleic Acid Sequences: Evolutionary Parsimony’ James A. Lake Molecular Biology Institute and ...»

A Rate-independent Technique for Analysis of Nucleic Acid

Sequences: Evolutionary Parsimony’

James A. Lake

Molecular Biology Institute and Department of Biology,

University of California, Los Angeles

The method of evolutionary parsimony-or operator invariants-is a technique

of nucleic acid sequence analysis related to parsimony analysis and explicitly designed for determining evolutionary relationships among four distantly related taxa.

The method is independent of substitution rates because it is derived from consideration of the group properties of substitution operators rather than from an analysis of the probabilities of substitution in branches of a tree. In both parsimony and evolutionary parsimony, three patterns of nucleotide substitution are associated one-to-one with the three topologically linked trees for four taxa. In evolutionary parsimony, the three quantities are operator invariants. These invariants are the remnants of substitutions that have occurred in the interior branch of the tree and are analogous to the substitutions assigned to the central branch by parsimony.

The two invariants associated with the incorrect trees must equal zero (statistically), whereas only the correct tree can have a nonzero invariant. The x*-test is used to ascertain the nonzero invariant and the statist&thy favored tree. Examples, obtained using data calculated with evolutionary rates and branchings designed to camouflage the true tree, show that the method accurately predicts the tree, even when substitution rates differ greatly in neighboring peripheral branches (conditions under which parsimony will consistently fail). As the number of substitutions in peripheral branches becomes fewer, the parsimony and the evolutionary-parsimony solutions converge. The method is robust and easy to use.

Introduction Parsimony analysis is one of the most widely used and generally accepted methods of phylogenetic analysis (Fitch 1977). It is characterized by both intellectual and operational simplicity. Yet, under conditions of unequal rates of substitution, parsimony can select an incorrect tree. Parsimony, as a successful method of phylogenetic determination, represents a baseline against which other methods can be measured.

The best-understood instance in which parsimony can incorrectly predict an unrooted tree occurs when sequences in neighboring peripheral branches of a tree evolve at greatly different rates. Felsenstein ( 1978), investigating a two-state model, showed that when highly different rates occur among four sequences the most parsimonious unrooted tree places the two most highly substituted peripheral branches on one side of the tree and the two least substituted peripheral branches together on the other side. This tree will be chosen no matter what the topology of the true tree.

In this paper I propose a method of phylogenetic analysis-related to parsimony

1. Key words: parsimony, phylogeny.

for correspondence and reprints: Dr. James A. Lake, Molecular Biology Institute, University Address of California, Los Angeles, California 90024.

Mol. Biol. Evol. 4(2):167-191. 1987.

0 1987 by The University of Chicago. rightsreserved.

All 0737-4038/87/0402-4207$02.00 168 Lake analysis and called evolutionary parsimony or the method of operator invariantsthat can predict the correct tree even when rates of nucleotide substitution differ by an order of magnitude in adjacent branches of the unrooted tree. The method is robust, and it is easy to calculate. In it, three quantities named “operator invariants” are calculated from four aligned nucleic acid sequences. The invariants are remnants of substitutions that have occurred in the interior branch of the tree and are analogous to the substitutions assigned by parsimony to the interior branch. Both the operator invariants and the parsimony terms are derived by analysis of patterns present in the aligned sequences. These three operator invariants are then used to predict the statistically significant dendrogram. The evolutionary-parsimony method is investigated by using data calculated from a tree of known topology and shown to accurately predict the initial tree under a variety of conditions, particularly within the zone in which the Felsenstein conditions prevail.

**Theory**

Determining the unrooted evolutionary tree that best reconstructs the evolution of four taxa requires the discrimination of a single tree from a set of three alternative tree topologies. Hence, the problem for four taxa serves as the simplest case model for developing a method to discriminate among topologically distinct dendrograms.

Parsimony and evolutionary parsimony have related-but differing-criteria.

Parsimony selects the tree that requires the minimum number of substitutions. In contrast, evolutionary parsimony selects the tree that requires the minimum number of consistent substitutions (“consistent” is used to imply consistency with evolution in the peripheral branches of the tree). In the limit that the number of substitutions in the peripheral branches of the tree becomes small relative to those in the central branch, all substitutions become consistent ones and the parsimony solution converges to the evolutionary-parsimony solution.

Two simple examples serve to illustrate the differences between substitutions and consistent substitutions. Consider the trees in figure 1. The initial tree in la refers to the tree used to calculate the sequences, and the most parsimonious tree is the tree inferred from analysis of the calculated sequences. In these examples, the probability of substitution is equal for all bases. Thus, for an RNA sequence (nucleotides C, U, A, or G), an A would be replaced with equal likelihood by U, C, or G. When there is a high probability of nucleotide substitution in the central branch of the initial tree and low probabilities in the other branches, one finds the pattern xxyy at most positions.

This is the informative pattern for parsimony and identifies the tree that positions taxa 1 and 2 together and 3 and 4 together as being the most parsimonious. The absence of other patterns, except for xxxx, indicates that most substitutions are consistent ones. In this example, parsimony correctly predicts the initial tree.

In the second example, figure lb, the probability of nucleotide substitutions is very large in the peripheral branches leading to taxa 1 and 3 and small in the branches leading to taxa 2 and 4 and in the central branch. Typical sequences are shown in the panel below the true tree, but the expected pattern xxyy that is diagnostic for the true tree is not present. Contrary to one’ expectations, the informative pattern for parsis mony that is present is xyxy. (For this example calculations show that, in the limit of infinite substitution in branches 1 and 3, the xyxy pattern should occur at fully 3/16 of the positions). Hence, the most parsimonious- or minimum substitution-tree in figure 1b is not the initial tree but is the tree that connects taxa 1 and 3 and connects Evolutionary Parsimony 169

PIG. 1.-Examples illustrating when parsimony correctly selects a tree and when it fails. Branch lengths (of either 0 or -0.8) represent the relative probabilities of a nucleotide difference at any one position. The patterns (Cavender 198 1) observed in the aligned sequences and the number of their occurrences are shown adjacent to the sequences. In la parsimony correctly predicts the true tree. In 1b the tree predicted by parsimony places taxa 1 and 3 and taxa 2 and 4 together in a different topological arrangement from that found in the true tree.

2 and 4. In this example, parsimony has picked an incorrect tree because substitutions inserted in peripheral branches of the tree have mimicked the pattern normally produced by substitutions in the central branch of an alternative tree topology. Those substitutions that mimic an incorrect pattern are described as inconsistent substitutions.

The presence of a second type of pattern (xyxz) indicates, however, that xyxy might represent inconsistent substitutions. In the following sections, explicit definitions of both consistent and inconsistent substitutions are detailed and a parsimony-like procedure for determining trees using consistent substitutions is presented.

A Vector Representation Descriptions of operator invariants and of both consistent and inconsistent substitutions are facilitated by using a vector representation of sequences. In this representation a set of four aligned sequences, each of length n, is represented as the vector sum of n vectors. Thus, in figure 2a, each of the 256 (or 43 possible combinations of nucleotides represents a direction in a 256dimensional sequence space. For example, the vector CGGC, present at the third position along the sequence, is one of 20 subvectors that make up the sequence vector, s.

This 256dimensionaI space can be considerably simplified if one includes information about the (molecular-biological) details of the substitution process. Because 170 Lake

DNA copying and repair mechanisms distinguish most readily between the larger purines and the smaller pyrimidines, exchanges that substitute one purine for another or one pyrimidine for another (transitions) occur much more frequently than those that interchange a pyrimidine and a purine (transversions). Wilson and co-workers (Brown et al. 1982), for example, have shown that, for mitochondrial DNAs, transitions occur an order of magnitude more frequently than transversions. This difference is applied to the definition of basis vectors in the following paragraph.

Distinguishing between transitions and transversions allows one to reduce the number of basis vectors from 256 to 36. This simpler representation, which replaces each of the nucleotide letter symbols with the numbers one through four, is shown in figure 2b. Since the representation of the nucleotide in position one in a vector is arbitrary, a “ 1” is assigned to represent it and all others of the same type. Any nucleotide related to the nucleotide in position one by a transition is assigned a “2” to represent it. The first nucleotide (if any) that is related to the nucleotide in position one by a transversion (and all others of the same type) is represented by a “3.” Finally, any nucleotide related by a transition to the type represented by a “3” is represented by a “4.” With this notation, any combination of four nucleotides can be represented by one of 36 types.

To simplify this further, a shorthand, one-letter, notation is introduced (table 1).

In the example in figure 2a, position CGGC becomes 133 1 in 2b and is abbreviated as vector component G; UGGG becomes 1333 and is abbreviated as component A.

Thus the set of four aligned sequences can be represented by the single line of components shown in figure 2c. Similarly, a unit vector pointing in the G direction will be represented as e, in the one-letter code.

Evolutionary Parsimony 171 This notation allows one to describe four aligned sequences either as spectral components of the aligned sequences or as a sequence vector. In the example in figure 2 the vector component G ( 133 1) occurs four times, and the val_ue of the G spectral component is listed as G = 4. Similarly, the sequence vector, S-corresponding to spectral components a, A, b, B, etc. and unit vectors I, A, etc.-is written as S=ai+AA+bh+BB+. (1) l l l Operator-Invariant Analysis The spectral components from the previous example are used to illustrate parsimony analysis and evolutionary-parsimony analysis (fig. 3). In this example and throughout this paper, the parsimony analysis will consider only transversion substitutions in the central branch of the tree. Thus, each of the three spectral components E, F, and G is most parsimoniously associated with one of three possible evolutionary trees called the E, the F, and the G trees, respectively. These trees are shown in figure

4. The tree associated with the largest component is most parsimonious, i.e., requires the minimal number of transversion substitutions. Parsimony analysis of the spectral components in figure 3 identifies the G tree as being most parsimonious.

The method of evolutionary parsimony is similar to parsimony but uses additional spectral components to determine consistent substitutions. As shown in figure 3 the operator invariants are linear combinations of four spectral components. As with parsimony, each invariant (X, Y, or Z) is associated with a tree (the E, F, and G trees, respectively). The evolutionary interpretation of the operator invariants is that they

FIG. 3.-The operator spectral components derived in fig. 2 analyzed using both the parsimony method and the method of evolutionary parsimony. In this example, evolutionary parsimony selects the correct E tree even though tree G is most parsimonious.

172 Lake are the remnants of transversion substitutions made in the central branch of the tree.

Only the historically correct tree has contributed consistent substitutions to the sequences, and only it can have a nonzero invariant. The two incorrect trees cannot have remnants and thus will be associated with (statistically) zero invariants. In the example in figure 3, only the X invariant (the E tree) is found to be significantly greater than zero when the invariants are analyzed by the X2-test (see Statistical Tests and Tree Selection below). The observation that the Z invariant is approximately equal to zero, even though the G tree is the most parsimonious, indicates that many of the substitutions supporting the G tree are inconsistent substitutions.

Each operator invariant (X, Y, or Z) has three types of spectral components measuring different aspects of the evolutionary process-namely, a parsimony term (E, F, or G), two peripheral branches terms (H and J, L and N, or Q and S), and a compensatory term (u, v, or w). As a guide to understanding the invariants, examples of the functioning of their components are given below.

Under conditions of low substitution rates in peripheral tree branches, the peripheral branches’ terms and the compensatory term will be small and only the parsimony term be large. This is the reason that, in this limit, parsimony and evolutionary parsimony predict the same tree.

When transversion substitutions in peripheral branches of the tree are frequent, this can artifactually increase the parsimony term associated with the incorrect trees.