Brauer–Siegel theorem: Difference between revisions

From formulasearchengine
Jump to navigation Jump to search
en>Brad7777
 
en>Yobot
m References: WP:CHECKWIKI error fixes - Replaced endash with hyphen in sortkey per WP:MCSTJR using AWB (9100)
 
Line 1: Line 1:
I would like to introduce myself to you, I am Andrew and my wife doesn't like it at all. Alaska is where I've always been residing. She works as a journey agent but quickly she'll be on her own. It's not a typical factor but what she likes doing is to perform domino but she doesn't have the time lately.<br><br>Feel free to surf to my web blog: psychic readings online ([http://165.132.39.93/xe/visitors/372912 165.132.39.93])
{{Regression bar}}
In [[statistics]], '''Bayesian multivariate linear regression''' is a
[[Bayesian]] approach to [[multivariate linear regression]], i.e. [[linear regression]] where the predicted outcome is a vector of correlated [[random variable]]s rather than a single scalar random variable. A more general treatment of this approach can be found in the article [[MMSE estimator]].
 
==Details==
 
Consider a regression problem where the [[dependent variable]] to be
predicted is not a single [[real-valued]] scalar but an ''m''-length vector
of correlated real numbers.  As in the standard regression setup, there
are ''n'' observations, where each observation ''i'' consists of ''k''-1
[[explanatory variable]]s, grouped into a vector <math>\mathbf{x}_i</math>
of length ''k'' (where a [[dummy variable (statistics)|dummy variable]] with a value of 1 has been
added to allow for an intercept coefficient). This can be viewed as a
set of ''m'' related regression problems for each observation ''i'':
 
:<math>y_{i,1} = \mathbf{x}_i^{\rm T}\boldsymbol\beta_{1} + \epsilon_{i,1}</math>
:<math>\cdots</math>
:<math>y_{i,m} = \mathbf{x}_i^{\rm T}\boldsymbol\beta_{m} + \epsilon_{i,m}</math>
 
where the set of errors <math>\{\epsilon_{i,1},\ldots,\epsilon_{i,m}\}</math>
are all correlated.  Equivalently, it can be viewed as a single regression
problem where the outcome is a [[row vector]] <math>\mathbf{y}_i^{\rm T}</math>
and the regression coefficient vectors are stacked next to each other, as follows:
 
:<math>\mathbf{y}_i^{\rm T} = \mathbf{x}_i^{\rm T}\mathbf{B} + \boldsymbol\epsilon_{i}^{\rm T}.</math>
 
The coefficient matrix '''B''' is a <math>k \times m</math> matrix where the coefficient vectors <math>\boldsymbol\beta_1,\ldots,\boldsymbol\beta_m</math> for each regression problem are stacked horizontally:
 
:<math>\mathbf{B} =
\begin{bmatrix}
\begin{pmatrix} \\ \boldsymbol\beta_1 \\ \\ \end{pmatrix}
\cdots
\begin{pmatrix} \\ \boldsymbol\beta_m \\ \\ \end{pmatrix}
\end{bmatrix}
=
\begin{bmatrix}
\begin{pmatrix}
\beta_{1,1} \\ \vdots \\ \beta_{1,k} \\
\end{pmatrix}
\cdots
\begin{pmatrix}
\beta_{m,1} \\ \vdots \\ \beta_{m,k} \\
\end{pmatrix}
\end{bmatrix}
.
</math>
 
The noise vector <math>\boldsymbol\epsilon_{i}</math> for each observation ''i''
is jointly normal, so that the outcomes for a given observation are
correlated:
 
:<math>\boldsymbol\epsilon_i \sim N(0, \boldsymbol\Sigma_{\epsilon}^2).</math>
 
We can write the entire regression problem in matrix form as:
 
:<math>\mathbf{Y} =\mathbf{X}\mathbf{B} + \mathbf{E},</math>
 
where '''Y''' and '''E''' are <math>n \times m</math> matrices.  The [[design matrix]] '''X''' is an <math>n \times k</math> matrix with the observations stacked vertically, as in the standard [[linear regression]] setup:
 
:<math>
\mathbf{X} = \begin{bmatrix} \mathbf{x}^{\rm T}_1 \\ \mathbf{x}^{\rm T}_2 \\ \vdots \\ \mathbf{x}^{\rm T}_n \end{bmatrix}
= \begin{bmatrix} x_{1,1} & \cdots & x_{1,k} \\
x_{2,1} & \cdots & x_{2,k} \\
\vdots & \ddots & \vdots \\
x_{n,1} & \cdots & x_{n,k}
\end{bmatrix}.
</math>
 
The classical, frequentists [[linear least squares (mathematics)|linear least squares]] solution is to simply estimate the matrix of regression coefficients <math>\hat{\mathbf{B}}</math> using the  [[Moore–Penrose_pseudoinverse|Moore-Penrose]] [[pseudoinverse]]:
 
:<math> \hat{\mathbf{B}} = (\mathbf{X}^{\rm T}\mathbf{X})^{-1}\mathbf{X}^{\rm T}\mathbf{Y}</math>.
 
To obtain the Bayesian solution, we need to specify the conditional likelihood and then find the appropriate conjugate prior.  As with the univariate case of [[Bayesian linear regression|linear Bayesian regression]], we will find that we can specify a natural conditional conjugate prior (which is scale dependent).
 
Let us write our conditional likelihood as
 
:<math>\rho(\mathbf{E}|\boldsymbol\Sigma_{\epsilon}) \propto (\boldsymbol\Sigma_{\epsilon}^{2})^{-n/2} \exp(-\frac{1}{2} {\rm tr}(\mathbf{E}^{\rm T} \boldsymbol\Sigma_{\epsilon}^{-1}\mathbf{E}) ) ,</math>
 
writing the error <math>\mathbf{E}</math> in terms of <math>\mathbf{Y},\mathbf{X},</math> and <math>\mathbf{B}</math> yields
 
:<math>\rho(\mathbf{Y}|\mathbf{X},\mathbf{B},\boldsymbol\Sigma_{\epsilon}) \propto (\boldsymbol\Sigma_{\epsilon}^{2})^{-n/2} \exp(-\frac{1}{2} {\rm tr}((\mathbf{Y}-\mathbf{X}\mathbf{\mathbf{B}})^{\rm T} \boldsymbol\Sigma_{\epsilon}^{-1}(\mathbf{Y}-\mathbf{X}\mathbf{\mathbf{B}})) ) ,</math>
 
We seek a natural conjugate prior—a joint density <math>\rho(\mathbf{B},\sigma_{\epsilon})</math>  which is of the same functional form as the likelihood. Since the likelihood is quadratic in <math>\mathbf{B}</math>, we re-write the likelihood so it is normal in <math>(\mathbf{B}-\hat{\mathbf{B}})</math> (the deviation from classical sample estimate)
 
Using the same technique as with [[Bayesian linear regression]], we decompose the exponential term using a matrix-form of the sum-of-squares technique. Here, however, we will also need to use the Matrix Differential Calculus ([[Kronecker product]] and [[vectorization (mathematics)|vectorization]] transformations).
 
First, let us apply sum-of-squares to obtain new expression for the likelihood:
 
:<math>\rho(\mathbf{Y}|\mathbf{X},\mathbf{B},\boldsymbol\Sigma_{\epsilon}) \propto \boldsymbol\Sigma_{\epsilon}^{-(n-k)/2} \exp(-{\rm tr}(-\frac{1}{2}\mathbf{S}^{\rm T}\boldsymbol\Sigma_{\epsilon}^{-1}\mathbf{S})) 
(\boldsymbol\Sigma_{\epsilon}^{2})^{-k/2} \exp(-\frac{1}{2} {\rm tr}((\mathbf{B}-\hat{\mathbf{B}})^{\rm T} \mathbf{X}^{\rm T}\boldsymbol\Sigma_{\epsilon}^{-1}\mathbf{X}(\mathbf{B}-\hat{\mathbf{B}})) )
,</math>
 
:<math>\mathbf{S} = \mathbf{Y} - \hat{\mathbf{B}}\mathbf{X}</math>
 
We would like to develop a conditional form for the priors:
 
:<math>\rho(\mathbf{B},\boldsymbol\Sigma_{\epsilon}) = \rho(\boldsymbol\Sigma_{\epsilon})\rho(\mathbf{B}|\boldsymbol\Sigma_{\epsilon}),</math>
 
where <math>\rho(\boldsymbol\Sigma_{\epsilon})</math> is an [[inverse-Wishart distribution]]
and <math>\rho(\mathbf{B}|\boldsymbol\Sigma_{\epsilon})</math> is some form of [[normal distribution]] in the matrix <math>\mathbf{B}</math>. This is accomplished using the [[vectorization (mathematics)|vectorization]] transformation, which converts the likelihood from a function of the matrices <math>\mathbf{B}, \hat{\mathbf{B}}</math> to a function of the vectors <math>\boldsymbol\beta = {\rm vec}(\mathbf{B}), \hat{\boldsymbol\beta} = {\rm vec}(\hat{\mathbf{B}})</math>.
 
Write
 
:<math>{\rm tr}((\mathbf{B} - \hat{\mathbf{B}})^{\rm T}\mathbf{X}^{\rm T} \boldsymbol\Sigma_{\epsilon}^{-1} \mathbf{X}(\mathbf{B} - \hat{\mathbf{B}})) = {\rm vec}(\mathbf{B} - \hat{\mathbf{B}})^{\rm T}{\rm vec}(\mathbf{X}^{\rm T} \boldsymbol\Sigma_{\epsilon}^{-1} \mathbf{X}(\mathbf{B} - \hat{\mathbf{B}}))</math>
 
Let
 
:<math> {\rm vec}(\mathbf{X}^{\rm T} \boldsymbol\Sigma_{\epsilon}^{-1} \mathbf{X}(\mathbf{B} - \hat{\mathbf{B}})) =  (\boldsymbol\Sigma_{\epsilon}^{-1} \otimes \mathbf{X}^{\rm T}\mathbf{X} ){\rm vec}(\mathbf{B} - \hat{\mathbf{B}}), </math>
where <math>\mathbf{A} \otimes \mathbf{B}</math> denotes the [[Kronecker product]] of matrices '''A''' and '''B''', a generalization of the [[outer product]] which multiplies an <math>m \times n</math> matrix by a <math>p \times q</math> matrix to generate an <math>mp \times nq</math> matrix, consisting of every combination of products of elements from the two matrices.
 
Then
 
:<math>{\rm vec}(\mathbf{B} - \hat{\mathbf{B}})^{\rm T} (\boldsymbol\Sigma_{\epsilon}^{-1} \otimes \mathbf{X}^{\rm T}\mathbf{X} ){\rm vec}(\mathbf{B} - \hat{\mathbf{B}})</math>
 
:::<math> = (\boldsymbol\beta-\hat{\boldsymbol\beta})(\boldsymbol\Sigma_{\epsilon}^{-1} \otimes \mathbf{X}^{\rm T}\mathbf{X} )(\boldsymbol\beta-\hat{\boldsymbol\beta})</math>
 
which will lead to a likelihood which is normal in <math>(\boldsymbol\beta - \hat{\boldsymbol\beta})</math>.
 
With the likelihood in a more tractable form, we can now find a natural (conditional) conjugate prior.
 
==See also==
* [[Bayesian linear regression]]
* [[Matrix normal distribution]]
 
==References==
{{No footnotes|date=November 2010}}
* Bradley P. Carlin and Thomas A. Louis, ''Bayes and Empirical Bayes Methods for Data Analysis'', Chapman & Hall/CRC, Second edition 2000,
 
* Peter E. Rossi, Greg M. Allenby, and Robert McCulloch, ''Bayesian Statistics and Marketing'', John Wiley & Sons, Ltd, 2006
 
{{DEFAULTSORT:Bayesian Multivariate Linear Regression}}
[[Category:Bayesian inference|Multivariate linear regression]]
[[Category:Regression analysis]]

Latest revision as of 16:35, 20 April 2013

Template:Regression bar In statistics, Bayesian multivariate linear regression is a Bayesian approach to multivariate linear regression, i.e. linear regression where the predicted outcome is a vector of correlated random variables rather than a single scalar random variable. A more general treatment of this approach can be found in the article MMSE estimator.

Details

Consider a regression problem where the dependent variable to be predicted is not a single real-valued scalar but an m-length vector of correlated real numbers. As in the standard regression setup, there are n observations, where each observation i consists of k-1 explanatory variables, grouped into a vector xi of length k (where a dummy variable with a value of 1 has been added to allow for an intercept coefficient). This can be viewed as a set of m related regression problems for each observation i:

yi,1=xiTβ1+ϵi,1
yi,m=xiTβm+ϵi,m

where the set of errors {ϵi,1,,ϵi,m} are all correlated. Equivalently, it can be viewed as a single regression problem where the outcome is a row vector yiT and the regression coefficient vectors are stacked next to each other, as follows:

yiT=xiTB+ϵiT.

The coefficient matrix B is a k×m matrix where the coefficient vectors β1,,βm for each regression problem are stacked horizontally:

B=[(β1)(βm)]=[(β1,1β1,k)(βm,1βm,k)].

The noise vector ϵi for each observation i is jointly normal, so that the outcomes for a given observation are correlated:

ϵiN(0,Σϵ2).

We can write the entire regression problem in matrix form as:

Y=XB+E,

where Y and E are n×m matrices. The design matrix X is an n×k matrix with the observations stacked vertically, as in the standard linear regression setup:

X=[x1Tx2TxnT]=[x1,1x1,kx2,1x2,kxn,1xn,k].

The classical, frequentists linear least squares solution is to simply estimate the matrix of regression coefficients B^ using the Moore-Penrose pseudoinverse:

B^=(XTX)1XTY.

To obtain the Bayesian solution, we need to specify the conditional likelihood and then find the appropriate conjugate prior. As with the univariate case of linear Bayesian regression, we will find that we can specify a natural conditional conjugate prior (which is scale dependent).

Let us write our conditional likelihood as

ρ(E|Σϵ)(Σϵ2)n/2exp(12tr(ETΣϵ1E)),

writing the error E in terms of Y,X, and B yields

ρ(Y|X,B,Σϵ)(Σϵ2)n/2exp(12tr((YXB)TΣϵ1(YXB))),

We seek a natural conjugate prior—a joint density ρ(B,σϵ) which is of the same functional form as the likelihood. Since the likelihood is quadratic in B, we re-write the likelihood so it is normal in (BB^) (the deviation from classical sample estimate)

Using the same technique as with Bayesian linear regression, we decompose the exponential term using a matrix-form of the sum-of-squares technique. Here, however, we will also need to use the Matrix Differential Calculus (Kronecker product and vectorization transformations).

First, let us apply sum-of-squares to obtain new expression for the likelihood:

ρ(Y|X,B,Σϵ)Σϵ(nk)/2exp(tr(12STΣϵ1S))(Σϵ2)k/2exp(12tr((BB^)TXTΣϵ1X(BB^))),
S=YB^X

We would like to develop a conditional form for the priors:

ρ(B,Σϵ)=ρ(Σϵ)ρ(B|Σϵ),

where ρ(Σϵ) is an inverse-Wishart distribution and ρ(B|Σϵ) is some form of normal distribution in the matrix B. This is accomplished using the vectorization transformation, which converts the likelihood from a function of the matrices B,B^ to a function of the vectors β=vec(B),β^=vec(B^).

Write

tr((BB^)TXTΣϵ1X(BB^))=vec(BB^)Tvec(XTΣϵ1X(BB^))

Let

vec(XTΣϵ1X(BB^))=(Σϵ1XTX)vec(BB^),

where AB denotes the Kronecker product of matrices A and B, a generalization of the outer product which multiplies an m×n matrix by a p×q matrix to generate an mp×nq matrix, consisting of every combination of products of elements from the two matrices.

Then

vec(BB^)T(Σϵ1XTX)vec(BB^)
=(ββ^)(Σϵ1XTX)(ββ^)

which will lead to a likelihood which is normal in (ββ^).

With the likelihood in a more tractable form, we can now find a natural (conditional) conjugate prior.

See also

References

Template:No footnotes

  • Bradley P. Carlin and Thomas A. Louis, Bayes and Empirical Bayes Methods for Data Analysis, Chapman & Hall/CRC, Second edition 2000,
  • Peter E. Rossi, Greg M. Allenby, and Robert McCulloch, Bayesian Statistics and Marketing, John Wiley & Sons, Ltd, 2006