|
|
Line 1: |
Line 1: |
| A number of different [[Markov chain|Markov]] '''models of DNA sequence evolution''' have been proposed. These [[substitution model]]s differ in terms of the parameters used to describe the rates at which one [[nucleotide]] replaces another during evolution. These models are frequently used in [[Molecular phylogenetics|molecular phylogenetic analyses]]. In particular, they are used during the calculation of likelihood of a tree (in [[Bayesian inference|Bayesian]] and [[maximum likelihood]] approaches to tree estimation) and they are used to estimate the evolutionary distance between sequences from the observed differences between the sequences.
| | Hi there. Allow me start by introducing the writer, her title is Myrtle Cleary. He used to be unemployed but now he is a meter reader. Years in the past we moved to North Dakota and I adore every working day residing here. Doing ceramics is what her family and her enjoy.<br><br>Here is my web site ... [http://www.beasts-of-america.com/beasts/groups/curing-your-candida-how-to-do-it-easily/ home std test] |
| | |
| ==Introduction==
| |
| | |
| These models are phenomenological descriptions of the evolution of DNA as a string of four discrete states. These Markov models do not explicitly depict the mechanism of mutation nor the action of natural selection. Rather they describe the relative rates of different changes. For example, mutational biases and [[purifying selection]] favoring conservative changes are probably both responsible for the relatively high rate of [[Transition (genetics)|transitions]] compared to [[transversions]] in evolving sequences. However, the Kimura (K80) model described below merely attempts to capture the effect of both forces in a parameter that reflects the relative rate of transitions to transversions.
| |
| | |
| Evolutionary analyses of sequences are conducted on a wide variety of time scales. Thus, it is convenient to express these models in terms of the instantaneous rates of change between different states (the ''Q'' matrices below). If we are given a starting (ancestral) state at one position, the model's ''Q'' matrix and a branch length expressing the expected number of changes to have occurred since the ancestor, then we can derive the probability of the descendant sequence having each of the four states. The mathematical details of this transformation from rate-matrix to probability matrix are described in [[Substitution model#The mathematics of substitution models|the mathematics of substitution models section]] of the [[substitution model]] page. By expressing models in terms of the instantaneous rates of change we can avoid estimating a large numbers of parameters for each branch on a phylogenetic tree (or each comparison if the analysis involves many pairwise sequence comparisons).
| |
| | |
| The models described on this page describe the evolution of a single site within a set of sequences. They are often used for analyzing the evolution of an entire [[locus (genetics)|locus]] by making the simplifying assumption that different sites evolve [[Iid|independently and are identically distributed]]. This assumption may be justifiable if the sites can be assumed to be evolving [[Neutral theory of molecular evolution|neutrally]]. If the primary effect of natural selection on the evolution of the sequences is to constrain some sites, then models of among-site rate-heterogeneity can be used. This approaches allows one to estimate only one matrix of relative rates of substitution, and another set of parameters describing the variance in the total rate of substitution across sites.
| |
| | |
| ==DNA evolution as a continuous-time Markov chain==
| |
| | |
| ===Continuous-time Markov chains===
| |
| ''Continuous-time'' [[Markov chains]] have the usual transition matrices
| |
| which are, in addition, parameterized by time, <math> t\ </math>. Specifically, if <math> E_1,\ldots,E_4\ </math> are the states, then the transition matrix
| |
| | |
| : <math> P(t) = \big(P_{ij}(t)\big)</math> where each individual entry, <math> P_{ij}(t)\ </math> refers to the probability that state <math> E_i\ </math> will change to state <math> E_j\ </math> in time <math> t\ </math>.
| |
| | |
| '''Example:''' We would like to model the substitution process in DNA sequences (''i.e.'' [[Jukes–Cantor]], Kimura, ''etc.'') in a continuous-time fashion. The corresponding transition matrices will look like:
| |
| | |
| : <math>
| |
| P(t) = \begin{pmatrix} p_{AA}(t) & p_{GA}(t) & p_{CA}(t) & p_{TA}(t) \\
| |
| p_{AG}(t) & p_{GG}(t) & p_{CG}(t) & p_{TG}(t) \\
| |
| p_{AC}(t) & p_{GC}(t) & p_{CC}(t) & p_{TC}(t) \\
| |
| p_{AT}(t) & p_{GT}(t) & p_{CT}(t) & p_{TT}(t)
| |
| \end{pmatrix} </math>
| |
| | |
| where the top-left and bottom-right 2 × 2 blocks correspond to ''transition probabilities'' and the top-right and bottom-left 2 × 2 blocks corresponds to ''transversion probabilities''.
| |
| | |
| '''Assumption:''' If at some time <math> t_0\ </math>, the Markov chain is in state <math> E_i\ </math>, then the probability that at time <math> t_0+t\ </math>, it will be in state <math>E_j\ </math> depends only upon <math>i\ </math>, <math>j\ </math> and <math>t\ </math>. This then allows us to write that probability as <math>p_{ij}(t)\ </math>.
| |
| | |
| '''Theorem:''' Continuous-time transition matrices satisfy:
| |
| | |
| ::<math>P(t+\tau) = P(t)P(\tau)\ </math>
| |
| | |
| '''Note:''' There is here a possible confusion between two meanings of the word ''transition''. (i) In the context of ''Markov chains'', transition is the general term that refers to the change between two states. (ii) In the context of ''nucleotide changes in DNA sequences'', transition is a specific term that refers to the exchange between either the two purines (A ↔ G) or the two pyrimidines (C ↔ T) (for additional details, see the article about [[Transition (genetics)|transitions in genetics]]). By contrast, an exchange between one purine and one pyrimidine is called a [[transversion]].
| |
| | |
| ===Deriving the dynamics of substitution===
| |
| Consider a DNA sequence of fixed length ''m'' evolving in time by base replacement. Assume that the processes followed by the ''m'' sites are Markovian independent, identically distributed and constant in time. For a fixed site, let
| |
| | |
| :<math> \mathbf{P}(t) = (p_A(t),\ p_G(t),\ p_C(t),\ p_T(t))^T</math>
| |
| | |
| be the column vector of probabilities of states <math>A, \ </math> <math>\ G, \ </math> <math>\ C, \ </math> and <math>\ T \ </math> at time <math> t \ </math>. Let
| |
| | |
| :<math> \mathcal{E} = \{A,\ G, \ C, \ T\}</math>
| |
| | |
| be the state-space. For two distinct
| |
| | |
| :<math> x, y \in \mathcal{E}</math>, let <math> \mu_{xy}\ </math>
| |
| | |
| be the transition rate from state <math>x\ </math> to state <math>y\ </math>. Similarly, for any <math> x\ </math>, let:
| |
| :<math>\mu_x = \sum_{y\neq x}\mu_{xy}</math>
| |
| | |
| The changes in the probability distribution <math> p_A(t)\ </math> for small increments of time <math>\Delta t\ </math> are given by:
| |
| :<math>p_A(t+\Delta t) = p_A(t) - p_A(t)\mu_A\Delta t + \sum_{x\neq A}p_x(t)\mu_{xA}\Delta t </math>
| |
| | |
| In other words (in frequentist language), the frequency of <math> A\ </math>'s at time <math> t + \Delta t\ </math> is equal to the frequency at time <math> t\ </math> minus the frequency of the ''lost'' <math> A\ </math>'s plus the frequency of the ''newly created'' <math> A\ </math>'s.
| |
| | |
| Similarly for the probabilities <math>p_G(t), \ p_C(t), \ \mathrm{and} \ p_T(t)</math>. We can write these compactly as:
| |
| :<math>\mathbf{P}(t+\Delta t) = \mathbf{P}(t) + Q\mathbf{P}(t)\Delta t </math>
| |
| | |
| where,
| |
| | |
| :<math> Q = \begin{pmatrix} -\mu_A & \mu_{GA} & \mu_{CA} & \mu_{TA} \\
| |
| \mu_{AG} & -\mu_G & \mu_{CG} & \mu_{TG} \\
| |
| \mu_{AC} & \mu_{GC} & -\mu_C & \mu_{TC} \\
| |
| \mu_{AT} & \mu_{GT} & \mu_{CT} & -\mu_T \end{pmatrix}</math>
| |
| or, alternately:
| |
| :<math> \mathbf{P}'(t) = Q\mathbf{P}(t) </math>
| |
| | |
| where, <math>Q\ </math> is the ''rate'' matrix. Note that by definition, the columns of <math>Q\ </math> sum to zero.
| |
| | |
| ===Ergodicity===
| |
| | |
| If all the transition probabilities, <math> \mu_{xy}\ </math> are positive, ''i.e.'' if all states <math> x, y \in \mathcal{E}\ </math> ''communicate'', then the Markov chain has a ''stationary'' distribution <math> \mathbf{\Pi} = \{\pi_x, \ x \in \mathcal{E} \} </math> where each <math> \pi_x \ </math> is the proportion of time spent in state <math> x\ </math> after the Markov chain has run for infinite time, and this probability does not depend upon the initial state of the process. Such a Markov chain is called, '''''ergodic'''''. In DNA evolution, under the assumption of a common process for each site, the stationary frequencies, <math> \pi_A, \pi_G, \pi_C, \pi_T \ </math> correspond to equilibrium base compositions.
| |
|
| |
| '''Definition''' A Markov process is ''stationary'' if its current distribution is the stationary distribution, ''i.e.'' <math> \mathbf{P}(t) = \Pi\ </math>
| |
| | |
| Thus, by using the differential equation above:
| |
| :<math> \frac{d\Pi}{dt} = Q\Pi = 0 </math>
| |
| | |
| ===Time reversibility===
| |
| | |
| '''Definition''': A stationary Markov process is ''time reversible'' if (in the steady state) the amount of change from state <math> x\ </math> to <math> y\ </math> is equal to the amount of change from <math> y\ </math> to <math> x\ </math>, (although the two states may occur with different frequencies). This means that:
| |
| :<math> \pi_x\mu_{xy} = \pi_y\mu_{yx} \ </math>
| |
| | |
| Not all stationary processes are reversible, however, almost all DNA evolution models assume time reversibility, which is considered to be a reasonable assumption.
| |
| | |
| Under the time reversibility assumption, let <math> s_{xy} = \mu_{xy}/\pi_y\ </math>, then it is easy to see that:
| |
| :<math> s_{xy} = s_{yx} \ </math>
| |
| | |
| '''Definition''' The symmetric term <math> s_{xy}\ </math> is called the ''exchangeability'' between states <math>x\ </math> and <math> y\ </math>. In other words, <math> s_{xy}\ </math> is the fraction of the frequency of state <math> x\ </math> that results as a result of transitions from state <math> y\ </math> to state <math> x\ </math>.
| |
| | |
| '''Corollary''' The 12 off-diagonal entries of the rate matrix, <math> Q\ </math> (note the off-diagonal entries determine the diagonal entries, since the rows of <math> Q\ </math> sum to zero) can be completely determined by 9 numbers; these are: 6 exchangeability terms and 3 stationary frequencies <math>\pi_x\ </math>, (since the stationary frequencies sum to 1).
| |
| | |
| ===Scaling of branch lengths===
| |
| By comparing extant sequences, one can determine the amount of sequence divergence. This raw measurement of divergence provides information about the number of changes that have occurred along the path separating the sequences. The simple count of differences (the [[Hamming distance]]) between sequences will often underestimate the number of substitution because of multiple hits (see [[homoplasy]]). Trying to estimate the exact number of changes that have occurred is difficult, and usually not necessary. Instead, branch lengths (and path lengths) in phylogenetic analyses are usually expressed in the expected number of changes per site. The path length is the product of the duration of the path in time and the mean rate of substitutions. While their product can be estimated, the rate and time are not identifiable from sequence divergence.
| |
| | |
| The descriptions of rate matrices on this page accurately reflect the relative magnitude of different substitutions, but these rate matrices are '''not''' scaled such that a branch length of 1 yields one expected change. This scaling can be accomplished by multiplying every element of the matrix by the same factor, or simply by scaling the branch lengths. If we use the β to denote the scaling factor, and ν to denote the branch length measured in the expected number of substitutions per site then βν is used the transition probability formulae below in place of μ''t''. Note that ν is a parameter to be estimated from data, and is referred to as the branch length, while β is simply a number that can be calculated from the rate matrix (it is not a separate free parameter).
| |
| | |
| The value of β can be found by forcing the expected rate of flux of states to 1. The diagonal entries of the rate-matrix (the ''Q'' matrix) represent -1 times the rate of leaving each state. For time-reversible models, we know the equilibrium state frequencies (these are simply the π''<sub>i</sub>'' parameter value for state ''i''). Thus we can find the expected rate of change by calculating the sum of flux out of each state weighted by the proportion of sites that are expected to be in that class. Setting β to be the reciprocal of this sum will guarantee that scaled process has an expected flux of 1:
| |
| :<math>\beta = 1/\left(-\sum_i \pi_i\mu_{ii}\right)</math>
| |
| For example, in the Jukes-Cantor, the scaling factor would be ''4/(3μ)' because the rate of leaving each state is ''3μ/4''.
| |
| | |
| == Most common models of DNA evolution ==
| |
| | |
| === JC69 model ([[Thomas H. Jukes|Jukes]] and [[Charles Cantor|Cantor]], 1969)<ref>{{cite book |author=Jukes TH and Cantor CR |year=1969 |title=Evolution of Protein Molecules |pages=21–132 |publisher=Academic Press |location=New York}}</ref> ===
| |
| JC69 is the simplest [[substitution model]]. There are several assumptions. It assumes equal base frequencies <math>\left(\pi_A = \pi_G = \pi_C = \pi_T = {1\over4}\right)</math> and equal [[mutation rates]]. The only parameter of this model is therefore <math>\mu</math>, the overall substitution rate. As previously mentioned, this variable becomes a constant when we normalize to the mean-rate to 1.
| |
| | |
| : <math>Q = \begin{pmatrix} {*} & {\mu\over 4} & {\mu\over 4} & {\mu\over 4} \\ {\mu\over 4} & {*} & {\mu\over 4}& {\mu\over 4}\\ {\mu\over 4}& {\mu\over 4}& {*} & {\mu\over 4}\\ {\mu\over 4}& {\mu\over 4}& {\mu\over 4}& {*} \end{pmatrix}</math>
| |
| | |
| : <math> P= \begin{pmatrix} {{1\over4} + {3\over4}e^{-t\mu}} & {{1\over4} - {1\over4}e^{-t\mu}} & {{1\over4} - {1\over4}e^{-t\mu}} & {{1\over4} - {1\over4}e^{-t\mu}} \\\\ {{1\over4} - {1\over4}e^{-t\mu}} & {{1\over4} + {3\over4}e^{-t\mu}} & {{1\over4} - {1\over4}e^{-t\mu}} & {{1\over4} - {1\over4}e^{-t\mu}} \\\\ {{1\over4} - {1\over4}e^{-t\mu}} & {{1\over4} - {1\over4}e^{-t\mu}} & {{1\over4} + {3\over4}e^{-t\mu}} & {{1\over4} - {1\over4}e^{-t\mu}} \\\\ {{1\over4} - {1\over4}e^{-t\mu}} & {{1\over4} - {1\over4}e^{-t\mu}} & {{1\over4} - {1\over4}e^{-t\mu}} & {{1\over4} + {3\over4}e^{-t\mu}} \end{pmatrix}</math>
| |
| | |
| When branch length, <math>\nu</math>, is measured in the expected number of changes per site then:
| |
| : <math>P_{ij}(\nu) = \left\{
| |
| \begin{array}{cc}
| |
| {1\over4} + {3\over4}e^{-4\nu/3} & \mbox{ if } i = j \\
| |
| {1\over4} - {1\over4}e^{-4\nu/3} & \mbox{ if } i \neq j
| |
| \end{array}
| |
| \right.</math>
| |
| | |
| It is worth notice that <math>\nu={3\over4}t\mu=({\mu\over4}+{\mu\over4}+{\mu\over4})t</math> what stands for sum of any column (or row) of matrix <math>Q</math> multiplied by time and thus means expected number of substitutions in time <math>t</math> (branch duration) for each particular site (per site) when the rate of substitution equals <math>\mu</math>.
| |
| | |
| Given the proportion <math>p</math> of sites that differ between the two sequences the Jukes-Cantor estimate of the evolutionary distance (in terms of the expected number of changes) between two sequences is given by
| |
| :<math>\hat{d}=-{3\over4} \ln({1-{4\over3}p})=\hat{\nu}</math>
| |
| The <math>p</math> in this formula is frequently referred to as the <math>p</math>-distance. It is a [[sufficient statistic]] for calculating the Jukes-Cantor distance correction, but is not sufficient for the calculation of the evolutionary distance under the more complex models that follow (also note that <math>p</math> used in subsequent formulae is not identical to the "<math>p</math>-distance").
| |
| | |
| === K80 model ([[Motoo Kimura|Kimura]], 1980)<ref>{{cite journal |author=Kimura M |year=1980 |title=A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences |journal=Journal of Molecular Evolution |volume=16 |pages=111–120 |doi=10.1007/BF01731581 |pmid=7463489 |issue=2}}</ref> ===
| |
| | |
| The K80 model distinguishes between [[Transition (genetics)|transitions]] (A <-> G, i.e. from purine to purine, or C <-> T, i.e. from pyrimidine to pyrimidine) and [[transversion]]s (from purine to pyrimidine or vice versa). In Kimura's original description of the model the α and β were used to denote the rates of these types of substitutions, but it is now more common to set the rate of transversions to 1 and use κ to denote the transition/transversion rate ratio (as is done below). The K80 model assumes that all of the bases are equally frequent (''π<sub>T</sub>=π<sub>C</sub>=π<sub>A</sub>=π<sub>G</sub>=0.25'').
| |
| | |
| Rate matrix <math>Q= \begin{pmatrix} {*} & {\kappa} & {1} & {1} \\ {\kappa} & {*} & {1} & {1} \\ {1} & {1} & {*} & {\kappa} \\ {1} & {1} & {\kappa} & {*} \end{pmatrix}</math>
| |
| | |
| The Kimura two-parameter distance is given by:
| |
| :<math>\hat{d} = - {1\over2}\ln(1-2p-q) - {1\over4} \ln (1-2q)</math>
| |
| where ''p'' is the proportion of sites that show transitional differences and
| |
| ''q'' is the proportion of sites that show transversional differences.
| |
| | |
| === F81 model ([[Joe Felsenstein|Felsenstein]] 1981)<ref>{{cite journal |author=Felsenstein J |year=1981 |title=Evolutionary trees from DNA sequences: a maximum likelihood approach |journal=Journal of Molecular Evolution |volume=17 |pages=368–376 |doi=10.1007/BF01734359 |pmid=7288891 |issue=6}}</ref> ===
| |
| | |
| [[Joseph Felsenstein|Felsenstein's]] 1981 model is an extension of the JC69 model in which base frequencies are allowed to vary from 0.25 (<math>\pi_T \ne \pi_C \ne \pi_A \ne \pi_G </math>)
| |
| | |
| Rate matrix:
| |
| | |
| : <math>Q= \begin{pmatrix} {*} & {\pi_C} & {\pi_A} & {\pi_G} \\ {\pi_T} & {*} & {\pi_A} & {\pi_G} \\ {\pi_T} & {\pi_C} & {*} & {\pi_G} \\ {\pi_T} & {\pi_C} & {\pi_A} & {*} \end{pmatrix}</math>
| |
| | |
| When branch length, ν, is measured in the expected number of changes per site then:
| |
| : <math>\beta = 1/(1-\pi_A^2-\pi_C^2-\pi_G^2-\pi_T^2)</math>
| |
| : <math>P_{ij}(\nu) = \left\{
| |
| \begin{array}{cc}
| |
| e^{-\beta\nu}+\pi_j\left(1- e^{-\beta\nu}\right) & \mbox{ if } i = j \\
| |
| \pi_j\left(1- e^{-\beta\nu}\right) & \mbox{ if } i \neq j
| |
| \end{array}
| |
| \right.</math>
| |
| | |
| === HKY85 model (Hasegawa, Kishino and Yano 1985)<ref>{{cite journal |author=Hasegawa M, Kishino H, Yano T |year=1985 |title=Dating of human-ape splitting by a molecular clock of mitochondrial DNA |journal=Journal of Molecular Evolution |volume=22 |pages=160–174 |doi=10.1007/BF02101694 |pmid=3934395 |issue=2}}</ref> ===
| |
| | |
| The HKY85 model can be thought of as combining the extensions made in the Kimura80 and Felsenstein81 models. Namely, it distinguishes between the rate of [[Transition (genetics)|transitions]] and [[transversion]]s (using the κ parameter), and it allows unequal base frequencies (<math>\pi_T \ne \pi_C \ne \pi_A \ne \pi_G </math>). [ Felsenstein described a similar (but not equivalent) model in 1984 using a different parameterization;<ref>{{cite journal |author=Kishino H, Hasegawa M |year=1989 |title=Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in hominoidea |journal=Journal of Molecular Evolution |volume=29 |issue=2 |pages=170–179 |doi=10.1007/BF02100115 |pmid=2509717}}</ref> that latter model is referred to as the F84 model.<ref>{{cite journal |author=Felsenstein J, Churchill GA |year=1996 |title=A Hidden Markov Model approach to variation among sites in rate of evolution, and the branching order in hominoidea |journal=Molecular Biology and Evolution |volume=13 |issue=1 |pages=93–104 |url=http://mbe.oxfordjournals.org/cgi/content/abstract/13/1/93 |doi=10.1093/oxfordjournals.molbev.a025575 |pmid=8583911}}</ref> ]
| |
| | |
| Rate matrix <math>Q= \begin{pmatrix} {*} & {\kappa\pi_C} & {\pi_A} & {\pi_G} \\ {\kappa\pi_T} & {*} & {\pi_A} & {\pi_G} \\ {\pi_T} & {\pi_C} & {*} & {\kappa\pi_G} \\ {\pi_T} & {\pi_C} & {\kappa\pi_A} & {*} \end{pmatrix}</math>
| |
| | |
| If we express the branch length, ''ν'' in terms of the expected number of changes per site then:
| |
| :<math>\beta = \frac{1}{2(\pi_A + \pi_G)(\pi_C + \pi_T) + 2\kappa[(\pi_A\pi_G) + (\pi_C\pi_T)]} </math>
| |
| :<math>P_{AA}(\nu,\kappa,\pi) = \left[\pi_A\left(\pi_A + \pi_G + (\pi_C + \pi_T)e^{-\beta\nu}\right) + \pi_G e^{-(1 + (\pi_A + \pi_G)(\kappa - 1.0))\beta\nu}\right]/(\pi_A + \pi_G) </math>
| |
| :<math>P_{AC}(\nu,\kappa,\pi) = \pi_C\left(1.0 - e^{-\beta\nu}\right) </math>
| |
| :<math>P_{AG}(\nu,\kappa,\pi) = \left[\pi_G\left(\pi_A + \pi_G + (\pi_C + \pi_T)e^{-\beta\nu}\right) - \pi_Ge^{-(1 + (\pi_A + \pi_G)(\kappa - 1.0))\beta\nu}\right] /\left(\pi_A + \pi_G\right) </math>
| |
| :<math>P_{AT}(\nu,\kappa,\pi) = \pi_T\left(1.0 - e^{-\beta\nu}\right) </math>
| |
| and formula for the other combinations of states can be obtained by substituting in the appropriate base frequencies.
| |
| | |
| === T92 model (Tamura 1992)<ref>{{cite journal |author=Tamura K |year=1992 |title=Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C content biases |journal=Molecular Biology and Evolution |volume=9 |pages=678–687 |url=http://mbe.oxfordjournals.org/cgi/content/abstract/9/4/678 |issue=4 |pmid=1630306}}</ref> ===
| |
| | |
| T92 is a simple mathematical method developed to estimate the number of nucleotide
| |
| substitutions per site between two DNA sequences, by extending Kimura’s (1980)
| |
| two-parameter method to the case where a G+C-content bias exists. This method
| |
| will be useful when there are strong transition-transversion and G+C-content biases,
| |
| as in the case of Drosophila mitochondrial DNA. (Tamura 1992)
| |
| | |
| One frequency only <math>\pi_{GC} </math>
| |
| | |
| <math>\pi_G = \pi_C = {\pi_{GC}\over 2} </math>
| |
| | |
| <math>\pi_A = \pi_T = {(1-\pi_{GC})\over 2}</math>
| |
| | |
| Rate matrix <math>Q= \begin{pmatrix} {*} & {\kappa(1-\pi_{GC})/2} & {(1-\pi_{GC})/2} & {(1-\pi_{GC})/2} \\ {\kappa\pi_{GC}/2} & {*} & {\pi_{GC}/2} & {\pi_{GC}/2} \\ {(1-\pi_{GC})/2} & {(1-\pi_{GC})/2} & {*} & {\kappa(1-\pi_{GC})/2} \\ {\pi_{GC}/2} & {\pi_{GC}/2} & {\kappa\pi_{GC}/2} & {*} \end{pmatrix}</math>
| |
| | |
| The evolutionary distance between two noncoding sequences according to this model is given by
| |
| :<math>d = -h \ln(1-{p\over h}-q)-{1\over2}(1-h)\ln(1-2q)</math>
| |
| where <math>h = 2\theta(1-\theta)</math> where <math>\theta \in (0,1)</math> is the GC content.
| |
| | |
| === TN93 model (Tamura and [[Masatoshi Nei|Nei]] 1993)<ref>{{cite journal |author=Tamura K, Nei M |year=1993 |title=Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees |journal=Molecular Biology and Evolution |volume=10 |pages=512–526 |url=http://mbe.oxfordjournals.org/cgi/content/abstract/10/3/512 |issue=3 |pmid=8336541}}</ref> ===
| |
| | |
| The TN93 model distinguishes between the two different types of [[Transition (genetics)|transition]] - i.e. (A <-> G) is allowed to have a different rate to (C<->T). [[Transversion]]s are all assumed to occur at the same rate, but that rate is allowed to be different from both of the rates for transitions.
| |
| | |
| TN93 also allows unequal base frequencies (<math>\pi_T \ne \pi_C \ne \pi_A \ne \pi_G </math>).
| |
| | |
| Rate matrix <math>Q= \begin{pmatrix} {*} & {\kappa_1\pi_C} & {\pi_A} & {\pi_G} \\ {\kappa_1\pi_T} & {*} & {\pi_A} & {\pi_G} \\ {\pi_T} & {\pi_C} & {*} & {\kappa_2\pi_G} \\ {\pi_T} & {\pi_C} & {\kappa_2\pi_A} & {*} \end{pmatrix}</math>
| |
| | |
| === GTR: Generalised time-reversible ([[Simon Tavaré|Tavaré]] 1986)<ref name="Tavare1986">{{cite journal |author=[[Simon Tavaré|Tavaré S]] |title=Some Probabilistic and Statistical Problems in the Analysis of DNA Sequences |publisher=American Mathematical Society |journal=Lectures on Mathematics in the Life Sciences |volume=17 |pages=57–86 |year=1986 |url=http://www.cmb.usc.edu/people/stavare/STpapers-pdf/T86.pdf}}</ref> ===
| |
| GTR is the most general neutral, independent, finite-sites, time-reversible model possible. It was first described in a general form by [[Simon Tavaré]] in 1986.<ref name="Tavare1986"/>
| |
| | |
| The GTR parameters consist of an equilibrium base frequency vector, <math>\Pi = (\pi_1 , \pi_2 , \pi_3 , \pi_4)</math>, giving the frequency at which each base occurs at each site, and the rate matrix
| |
| | |
| : <math>Q = \begin{pmatrix}
| |
| {-(x_1 + x_2 + x_3)} & {\pi_1 x_1 \over \pi_2} & {\pi_1 x_2 \over \pi_3} & {\pi_1 x_3 \over \pi_4} \\
| |
| {x_1} & {-({\pi_1 x_1 \over \pi_2} + x_4 + x_5)} & {\pi_2 x_4 \over \pi_3} & {\pi_2 x_5 \over \pi_4} \\
| |
| {x_2} & {x_4} & {-({\pi_1 x_2 \over \pi_3} + {\pi_2 x_4 \over \pi_3} + x_6)} & {\pi_3 x_6 \over \pi_4} \\
| |
| {x_3} & {x_5} & {x_6} & {-({\pi_1 x_3 \over \pi_4} + {\pi_2 x_5 \over \pi_4} + {\pi_3 x_6 \over \pi_4})}
| |
| \end{pmatrix} </math>
| |
| | |
| Therefore, GTR (for four characters, as is often the case in phylogenetics) requires 6 substitution rate parameters, as well as 4 equilibrium base frequency parameters. However, this is usually eliminated down to 9 parameters plus <math>\mu</math>, the overall number of substitutions per unit time. When measuring time in substitutions (<math>\mu</math>=1) only 8 free parameters remain.
| |
| | |
| In general, to compute the number of parameters, one must count the number of entries above the diagonal in the matrix, i.e. for n trait values per site <math>{{n^2-n} \over 2} </math>, and then add ''n'' for the equilibrium base frequencies, and subtract 1 because <math>\mu</math> is fixed. One gets
| |
| | |
| : <math>{{n^2-n} \over 2} + n - 1 = {1 \over 2}n^2 + {1 \over 2}n - 1.</math>
| |
| | |
| For example, for an amino acid sequence (there are 20 "standard" amino acids that make up [[proteins]]), one would find there are 209 parameters. However, when studying coding regions of the genome, it is more common to work with a [[codon]] substitution model (a codon is three bases and codes for one amino acid in a protein). There are <math>4^3 = 64</math> codons, but the rates for transitions between codons which differ by more than one base is assumed to be zero. Hence, there are <math>{{20 \times 19 \times 3} \over 2} + 64 - 1 = 633</math> parameters.
| |
| | |
| ==See also==
| |
| *[[Molecular evolution]]
| |
| *[[Molecular clock]]
| |
| *[[UPGMA]]
| |
| | |
| ==References==
| |
| {{Reflist|2}}
| |
| | |
| ===Further reading===
| |
| {{more footnotes|date=November 2010}}
| |
| * {{cite journal |author=Gu X, Li W |year=1992 |title=Higher rates of amino acid substitution in rodents than in man |journal=Molecular Phylogenetics and Evolution |volume=1 |pages=211–214 |doi=10.1016/1055-7903(92)90017-B |pmid=1342937 |issue=3}}
| |
| * {{cite journal |author=Li W-H, Ellsworth DL, Krushkal J, Chang BH-J, Hewett-Emmett D |year=1996 |title=Rates of nucleotide substitution in primates and rodents and the generation-time effect hypothesis |journal=Molecular Phylogenetics and Evolution |volume=5 |pages=182–187 |doi=10.1006/mpev.1996.0012 |pmid=8673286 |issue=1}}
| |
| | |
| ==External links==
| |
| *[http://scit.us/projects/dawg DAWG: DNA Assembly With Gaps] — free software for simulating sequence evolution
| |
| | |
| {{MolecularEvolution}}
| |
| {{Evolution}}
| |
| | |
| [[Category:Bioinformatics]]
| |
| [[Category:Phylogenetics]]
| |
| [[Category:Computational phylogenetics]]
| |
| [[Category:Markov models]]
| |