|
|
Line 1: |
Line 1: |
| The '''Akaike information criterion''' ('''AIC''') is a measure of the relative quality of a [[statistical model]], for a given set of data. As such, AIC provides a means for [[model selection]].
| |
|
| |
|
| AIC deals with the trade-off between the [[goodness of fit]] of the model and the complexity of the model. It is founded on [[information entropy]]: it offers a relative estimate of the [[Kullback-Leibler divergence|information lost]] when a given model is used to represent the process that generates the data.
| |
|
| |
|
| AIC does not provide a test of a model in the sense of testing a [[null hypothesis]]; i.e. AIC can tell nothing about the quality of the model in an absolute sense. If all the candidate models fit poorly, AIC will not give any warning of that.
| | The author is called Clemente Carrier but large number of misspell it. My wife and I selected to are in Florida but my wife wants us to relocation. Interviewing is her profession but soon she'll be on her own. To model railways is a factor that she is completely addicted to help. Her husband and her keep a website. It is advisable to check it out: http://www.hedgingplants.com/<br><br>Look at my site :: [http://www.hedgingplants.com/ hedging] |
| | |
| ==Definition==
| |
| In the general case, the AIC is
| |
| | |
| :<math>\mathit{AIC} = 2k - 2\ln(L)</math>
| |
| | |
| where ''k'' is the number of [[parameter]]s in the [[statistical model]], and ''L'' is the maximized value of the [[likelihood function]] for the estimated model.
| |
| | |
| Given a set of candidate models for the data, ''the preferred model is the one with the minimum AIC value.'' Hence AIC not only rewards goodness of fit, but also includes a penalty that is an increasing function of the number of estimated parameters. This penalty discourages [[overfitting]] (increasing the number of free parameters in the model improves the goodness of the fit, regardless of the number of free parameters in the data-generating process).
| |
| | |
| AIC is founded in information theory. Suppose that the data is generated by some unknown process ''f''. We consider two candidate models to represent ''f'': ''g''<sub>1</sub> and ''g''<sub>2</sub>. If we knew ''f'', then we could find the information lost from using ''g''<sub>1</sub> to represent ''f'' by calculating the [[Kullback–Leibler divergence]], ''D''<sub>KL</sub>(''f'' ‖ ''g''<sub>1</sub>); similarly, the information lost from using ''g''<sub>2</sub> to represent ''f'' would be found by calculating ''D''<sub>KL</sub>(''f'' ‖ ''g''<sub>2</sub>). We would then choose the candidate model that minimized the information loss.
| |
| | |
| We cannot choose with certainty, because we do not know ''f''. Akaike (1974) showed, however, that we can estimate, via AIC, how much more (or less) information is lost by ''g''<sub>1</sub> than by ''g''<sub>2</sub>. It is remarkable that such a simple formula for AIC results. The estimate, though, is only valid [[Asymptotic analysis|asymptotically]]; if the number of data points is small, then some correction is often necessary (see AICc, below).
| |
| | |
| ==How to apply AIC in practice==
| |
| To apply AIC in practice, we start with a set of candidate models, and then find the models' corresponding AIC values. There will almost always be information lost due to using one of the candidate models to represent the "true" model (i.e. the process that generates the data). We wish to select, from among ''R'' candidate models, the model that minimizes the information loss. We cannot choose with certainty, but we can minimize the estimated information loss.
| |
| | |
| Denote the AIC values of the candidate models by AIC<sub>1</sub>, AIC<sub>2</sub>, AIC<sub>3</sub>, …, AIC<sub>''R''</sub>. Let AIC<sub>min</sub> be the minimum of those values. Then exp((AIC<sub>min</sub>−AIC<sub>''i''</sub>)/2) can be interpreted as the relative probability that the ''i''th model minimizes the (estimated) information loss.<ref>{{Harvnb|Burnham|Anderson|2002|loc=§6.4.5}}</ref>
| |
| | |
| As an example, suppose that there were three models in the candidate set, with AIC values 100, 102, and 110. Then the second model is exp((100−102)/2) = 0.368 times as probable as the first model to minimize the information loss, and the third model is exp((100−110)/2) = 0.007 times as probable as the first model to minimize the information loss.
| |
| | |
| In this example, we would omit the third model from further consideration. We then have three options: (1) we could decide to gather more data, in the hope that this will allow clearly distinguishing between the first two models; (2) we could simply conclude that the data is insufficient to support selecting one model from among the first two; (3) we could take a weighted average of the first two models, with weights 1 and 0.368, respectively, and then do [[statistical inference]] based on the weighted multimodel.<ref>{{Harvnb|Burnham|Anderson|2002}}</ref>
| |
| | |
| The quantity exp((AIC<sub>min</sub>−AIC<sub>''i''</sub>)/2) is the ''[[Likelihood function#Relative_likelihood|relative likelihood]]'' of model ''i''.
| |
| | |
| If all the models in the candidate set have the same number of parameters, then using AIC might at first appear to be very similar to using the [[likelihood-ratio test]]. There are, however, important distinctions. In particular, the likelihood-ratio test is valid only for nested models whereas AIC (and AICc) has no such restriction.<ref>{{Harvnb|Burnham|Anderson|2002|loc=§2.12.4}}</ref>
| |
| | |
| ==AICc==
| |
| AICc is AIC with a correction for finite sample sizes:
| |
| | |
| :<math>AICc = AIC + \frac{2k(k + 1)}{n - k - 1}</math>
| |
| | |
| where ''n'' denotes the sample size. Thus, AICc is AIC with a greater penalty for extra parameters.
| |
| | |
| {{Harvtxt|Burnham|Anderson|2002}} strongly recommend using AICc, rather than AIC, if ''n'' is small or ''k'' is large. Since AICc converges to AIC as ''n'' gets large, AICc generally should be employed regardless.<ref>{{Harvnb|Burnham|Anderson|2004}}</ref> Using AIC, instead of AICc, when ''n'' is not many times larger than ''k''<sup>2</sup>, increases the probability of selecting models that have too many parameters, i.e. of overfitting. The probability of AIC overfitting can be substantial, in some cases.<ref>{{Harvnb|Claeskens|Hjort|2008|loc=§8.3}}</ref>
| |
| | |
| {{Harvtxt|Brockwell|Davis|1991|p=273}} advise using AICc as the primary criterion in selecting the orders of an [[Autoregressive moving average model|ARMA]] model for time series. {{Harvtxt|McQuarrie|Tsai|1998}} ground their high opinion of AICc on extensive simulation work with regression and time series.
| |
| | |
| AICc was first proposed by {{Harvtxt|Hurvich|Tsai|1989}}. Different derivations of it are given by {{Harvtxt|Brockwell|Davis|1991}}, Burnham & Anderson, and {{Harvtxt|Cavanaugh|1997}}. All the derivations assume a univariate linear model with normally distributed errors (conditional upon regressors); if that assumption does not hold, then the formula for AICc will usually change. Further discussion of this, with examples of other assumptions, is given by {{Harvtxt|Burnham|Anderson|2002|loc=ch. 7}}. In particular, [[Bootstrapping (statistics)|bootstrap]] estimation is usually feasible.
| |
| | |
| Note that when all the models in the candidate set have the same ''k'', then AICc and AIC will give identical (relative) valuations. In that situation, then, AIC can always be used.
| |
| | |
| ==Relevance to chi-squared fitting==
| |
| Often, one wishes to select amongst competing models where the likelihood functions assume that the underlying errors are normally distributed (with mean zero) and independent. This assumption leads to [[Chi-squared distribution|<math>\chi^2</math>]] model fitting.
| |
| | |
| For <math>\chi^2</math> fitting, the likelihood is given by
| |
| :<math>L=\prod_{i=1}^n \left(\frac{1}{2 \pi \sigma_i^2}\right)^{1/2} \exp \left( -\sum_{i=1}^{n}\frac{(y_i-f(x_i))^2}{2\sigma_i^2}\right)</math>
| |
| :<math>\therefore \ln(L) = \ln\left(\prod_{i=1}^n\left(\frac{1}{2\pi\sigma_i^2}\right)^{1/2}\right) - \frac{1}{2}\sum_{i=1}^n \frac{(y_i-f(x_i))^2}{\sigma_i^2}</math>
| |
| :<math>\therefore \ln(L) = C - \chi^2/2 \,</math>,
| |
| where ''C'' is a constant independent of the model used, and dependent only on the use of particular data points. i.e. it does not change if the data do not change.
| |
| | |
| The AIC is therefore given by <math>AIC = 2k - 2\ln(L) = 2k - 2(C-\chi^2/2) = 2k -2C + \chi^2 \,</math>. As only differences in AIC are meaningful, the constant ''C'' can be ignored, allowing us to take <math>AIC = \chi^2 + 2k</math> for model comparisons.
| |
| | |
| Another convenient form arises if the σ<sub>''i''</sub> are assumed to be identical and the [[residual sum of squares]] (RSS) is available. Then we get AIC = ''n'' ln(RSS/''n'') + 2''k'' + ''C'', where again ''C'' can be ignored in model comparisons.<ref>{{Harvnb|Burnham|Anderson|2002|pp=12, 63}}</ref>
| |
| | |
| ==History==
| |
| The Akaike information criterion was developed by [[Hirotugu Akaike]], under the name of "an information criterion". It was first published by Akaike in 1974.<ref name="Akaiki1974">{{Harvnb|Akaike|1974}}</ref>
| |
| | |
| The original derivation of AIC relied upon some strong assumptions. {{Harvtxt|Takeuchi|1976}} showed that the assumptions could be made much weaker. This work, however, was in Japanese, and was not widely known outside Japan for many years.
| |
| | |
| AICc was originally proposed for linear regression (only) by {{Harvtxt|Sugiura|1978}}. That instigated the work of {{Harvtxt|Hurvich|Tsai|1989}}, and several further papers by the same authors, which extended the situations in which AICc could be applied. The work of Hurvich & Tsai contributed to the decision to publish a second edition of the volume by {{Harvtxt|Brockwell|Davis|1991}}, which is the standard reference for linear time series; the new edition states, "our prime criterion for model selection [among ARMA(''p'',''q'') models] will be the AICc".<ref>{{Harvnb|Brockwell|Davis|1991|p=273}}</ref>
| |
| | |
| The volume by {{Harvtxt|Burnham|Anderson|2002}} was the first attempt to set out the information-theoretic approach in a general context. It includes an English exposition of the results of Takeuchi. The volume led to far greater use of the information-theoretic approach, and now has over 20000 citations on [[Google Scholar]].
| |
| | |
| Akaike originally called his approach an “entropy maximization principle”. {{Harvtxt|Burnham|Anderson|2002|loc=ch. 2}} discuss and expand on this, and trace the approach back to the work of [[Ludwig Boltzmann]] on thermodynamics. Briefly, minimizing AIC in a statistical model is essentially equivalent to maximizing entropy in a thermodynamic system. In other words, the information-theoretic approach in statistics is essentially applying the [[Second Law of Thermodynamics]].
| |
| | |
| ==Comparison with BIC==
| |
| | |
| The AIC penalizes the number of parameters less strongly than does the [[Bayesian information criterion]] (BIC). A comparison of AIC/AICc and BIC is given by {{Harvtxt|Burnham|Anderson|2002|loc=§6.4}}. The authors show that AIC and AICc can be derived in the same Bayesian framework as BIC, just by using a different prior. The authors also argue that AIC/AICc has theoretical advantages over BIC. First, because AIC/AICc is derived from principles of information; BIC is not, despite its name. Second, because the (Bayesian-framework) derivation of BIC has a prior of 1/''R'' (where ''R'' is the number of candidate models), which is "not sensible", since the prior should be a decreasing function of ''k''. Additionally, they present a few simulation studies that suggest AICc tends to have practical/performance advantages over BIC. See too {{Harvtxt|Burnham|Anderson|2004}}.
| |
| | |
| Further comparison of AIC and BIC, in the context of regression, is given by {{Harvtxt|Yang|2005}}. In particular, AIC is asymptotically optimal in selecting the model with the least [[mean squared error]], under the assumption that the exact "true" model is not in the candidate set (as is virtually always the case in practice); BIC is not asymptotically optimal under the assumption. Yang further shows that the rate at which AIC converges to the optimum is, in a certain sense, the best possible.
| |
| | |
| ==See also==
| |
| *[[Deviance information criterion]]
| |
| *[[Focused information criterion]]
| |
| *[[Hannan–Quinn information criterion]]
| |
| *[[Occam's razor]]
| |
| | |
| ==Notes==
| |
| {{reflist|30em}}
| |
| | |
| == References ==
| |
| <references/>
| |
| *{{Citation |first=Hirotugu |last=Akaike |authorlink=Hirotugu Akaike |year=1974 |title=A new look at the statistical model identification |journal=IEEE Transactions on Automatic Control |volume=19 |issue=6 |pages=716–723 |doi=10.1109/TAC.1974.1100705 |mr=0423716 }}.
| |
| *{{Citation |last=Akaike |first=Hirotugu |authorlink=Hirotugu Akaike |year=1980 |chapter=Likelihood and the Bayes procedure |title=Bayesian Statistics |editor-first=J. M. |editor-last=Bernardo |editorlink=José-Miguel Bernardo |editor2=et al. |location=Valencia |publisher=University Press |pages=143–166 |isbn= }}.
| |
| *{{Citation |last=Anderson |first=D. R. |year=2008 |title=Model Based Inference in the Life Sciences |publisher=Springer |isbn= }}.
| |
| *{{Citation |last=Brockwell |first=Peter J. |last2=Davis |first2=Richard A. |year=1987 |title=Time Series: Theory and Methods |edition= |publisher=Springer |isbn=0387964061 }}.
| |
| *{{Citation |last=Brockwell |first=Peter J. |last2=Davis |first2=Richard A. |year=1991 |title=Time Series: Theory and Methods |edition=2nd |publisher=Springer |isbn=0387974296 }}. Republished in 2009: ISBN 1441903194
| |
| *{{Citation |last=Burnham |first=K. P. |last2=Anderson |first2=D. R. |year=2002 |title=Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach |edition=2nd |publisher=Springer-Verlag |isbn=0-387-95364-7 }}.
| |
| *{{Citation |last=Burnham |first=K. P. |last2=Anderson |first2=D. R. |year=2004 |url=http://www.sortie-nd.org/lme/Statistical%20Papers/Burnham_and_Anderson_2004_Multimodel_Inference.pdf |title=Multimodel inference: understanding AIC and BIC in Model Selection |journal=Sociological Methods and Research |issn= |publisher= |volume=33 |issue= |pages=261–304 |doi= }}.
| |
| *{{Citation |last=Cavanaugh |first=J. E. |year=1997 |title=Unifying the derivations of the Akaike and corrected Akaike information criteria |journal=Statistics and Probability Letters |volume=31 |issue= |pages=201–208 |doi= }}.
| |
| *{{Citation |last=Claeskens |first=G. |first2=N. L. |last2=Hjort |year=2008 |title=Model Selection and Model Averaging |publisher=Cambridge |isbn= }}.
| |
| *Fang, Yixin (2011). "[http://www.jds-online.com/file_download/278/JDS-652a.pdf Asymptotic equivalence between cross-validations and Akaike Information Criteria in mixed-effects models]", ''Journal of Data Science'', 9:15-21.
| |
| *{{Citation |last=Hurvich |first=C. M. |last2=Tsai |first2=C.-L. |year=1989 |title=Regression and time series model selection in small samples |journal=[[Biometrika]] |issn= |volume=76 |issue= |pages=297–307 |doi= }}.
| |
| *Lukacs, P.M., et al. (2007). "Concerns regarding a call for pluralism of information theory and hypothesis testing", ''Journal of Applied Ecology'', 44:456–460. {{doi|10.1111/j.1365-2664.2006.01267.x}}.
| |
| *{{Citation |last=McQuarrie |first=A. D. R. |last2=Tsai |first2=C.-L. |year=1998 |title=Regression and Time Series Model Selection |publisher=World Scientific |isbn=981-02-3242-X }}.
| |
| *{{Citation |last=Sugiura |first=N. |year=1978 |title=Further analysis of the data by Akaike’s information criterion and the finite corrections |journal=[[Communications in Statistics]] - Theory and Methods |volume=A7 |issue= |pages=13–26 |doi= }}.
| |
| *{{Citation |last=Takeuchi |first=K. |year=1976 |language=Japanese | title= |trans_title="Distribution of informational statistics and a criterion of model fitting" |journal=Suri-Kagaku [Mathematical Sciences] |volume=153 |issue= |pages=12–18 |doi= }}.
| |
| *{{Citation |last=Yang |first=Y. |year=2005 |title=Can the strengths of AIC and BIC be shared? |journal=Biometrika |issn= |volume=92 |issue= |pages=937–950 |doi= }}.
| |
| | |
| == External links ==
| |
| * [http://www.garfield.library.upenn.edu/classics1981/A1981MS54100001.pdf Hirotogu Akaike comments on how he arrived at the AIC], in ''This Week's Citation Classic'' (21 December 1981)
| |
| * [https://noppa.tkk.fi/noppa/kurssi/s-114.1310/luennot/extramaterial.pdf AIC] ([[Aalto University]])
| |
| * [http://www4.ncsu.edu/~shu3/Presentation/AIC.pdf Akaike Information Criterion] ([[North Carolina State University]])
| |
| * [http://www-nrd.nhtsa.dot.gov/pdf/esv/esv20/07-0476-W.pdf Example AIC use] (Honda USA, Noesis Solutions, Belgium)
| |
| * [http://myweb.uiowa.edu/cavaaugh/ms_seminar.html Model Selection] ([[University of Iowa]])
| |
| | |
| {{DEFAULTSORT:Akaike Information Criterion}}
| |
| [[Category:Regression variable selection]]
| |
| [[Category:Model selection]]
| |