Papyrus 57: Difference between revisions

Revision as of 17:54, 26 February 2013

The structured support vector machine is a machine learning algorithm that generalizes the Support Vector Machine (SVM) classifier. Whereas the SVM classifier supports binary classification, multiclass classification and regression, the structured SVM allows training of a classifier for general structured output labels.

As an example, a sample instance might be a natural language sentence, and the output label is an annotated parse tree. Training a classifier consists of showing pairs of correct sample and output label pairs. After training, the structured SVM model allows one to predict for new sample instances the corresponding output label; that is, given a natural language sentence, the classifier can produce the most likely parse tree.

Training

For a set of $ℓ$ training instances $(x_{n}, y_{n}) \in 𝒳 \times 𝒴$ , $n = 1, \dots, ℓ$ from a sample space $𝒳$ and label space $𝒴$ , the structured SVM minimizes the following regularized risk function.

\min_{w} ‖ w ‖^{2} + C \sum_{n = 1}^{ℓ} \max_{y \in 𝒴} (Δ (y_{n}, y) + w^{'} Ψ (x_{n}, y) - w^{'} Ψ (x_{n}, y_{n}))

The function is convex in $w$ because the maximum of a set of affine functions is convex. The function $Δ : 𝒴 \times 𝒴 \to ℝ_{+}$ measures a distance in label space and is an arbitrary function (not necessarily a metric) satisfying $Δ (y, z) \geq 0$ and $Δ (y, y) = 0 \forall y, z \in 𝒴$ . The function $Ψ : 𝒳 \times 𝒴 \to ℝ^{d}$ is a feature function, extracting some feature vector from a given sample and label. The design of this function depends very much on the application.

Because the regularized risk function above is non-differentiable, it is often reformulated in terms of a quadratic program by introducing one slack variable $ξ_{n}$ for each sample, each representing the value of the maximum. The standard structured SVM primal formulation is given as follows.

\begin{array}{cl} \min_{w, ξ} & ‖ w ‖^{2} + C \sum_{n = 1}^{ℓ} ξ_{n} \\ s . t . & w^{'} Ψ (x_{n}, y_{n}) - w^{'} Ψ (x_{n}, y) + ξ_{n} \geq Δ (y_{n}, y), n = 1, \dots, ℓ, \forall y \in 𝒴 \end{array}

Inference

At test time, only a sample $x \in 𝒳$ is known, and a prediction function $f : 𝒳 \to 𝒴$ maps it to a predicted label from the label space $𝒴$ . For structured SVMs, given the vector $w$ obtained from training, the prediction function is the following.

f (x) = \underset{y \in 𝒴}{argmax} w^{'} Ψ (x, y)

Therefore, the maximizer over the label space is the predicted label. Solving for this maximizer is the so-called inference problem and similar to making a maximum a-posteriori (MAP) prediction in probabilistic models. Depending on the structure of the function $Ψ$ , solving for the maximizer can be a hard problem.

Separation

The above quadratic program involves a very large, possibly infinite number of linear inequality constraints. In general, the number of inequalities is too large to be optimized over explicitly. Instead the problem is solved by using delayed constraint generation where only a finite and small subset of the constraints is used. Optimizing over a subset of the constraints enlarges the feasible set and will yield a solution which provides a lower bound on the objective. To test whether the solution $w$ violates constraints of the complete set inequalities, a separation Template:Disambiguation needed problem needs to be solved. As the inequalities decompose over the samples, for each sample $(x_{n}, y_{n})$ the following problem needs to be solved.

y_{n}^{*} = \underset{y \in 𝒴}{argmax} (Δ (y_{n}, y) + w^{'} Ψ (x_{n}, y) - w^{'} Ψ (x_{n}, y_{n}) - ξ_{n})

The right hand side objective to be maximized is composed of the constant $- w^{'} Ψ (x_{n}, y_{n}) - ξ_{n}$ and a term dependent on the variables optimized over, namely $Δ (y_{n}, y) + w^{'} Ψ (x_{n}, y)$ . If the achieved right hand side objective is smaller or equal to zero, no violated constraint for this sample exist. If it is strictly larger than zero, the most violated constraint with respect to this sample has been identified. The problem is enlarged by this constraint and resolved. The process continues until no violated inequalities can be identified.

If the constants are dropped from the above problem, we obtain the following problem to be solved.

y_{n}^{*} = \underset{y \in 𝒴}{argmax} (Δ (y_{n}, y) + w^{'} Ψ (x_{n}, y))

This problem looks very similar to the inference problem. The only difference is the addition of the term $Δ (y_{n}, y)$ . Most often, it is chosen such that it has a natural decomposition in label space. In that case, the influence of $Δ$ can be encoded into the inference problem and solving for the most violating constraint is equivalent to solving the inference problem.

References

Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann and Yasemin Altun (2005), Large Margin Methods for Structured and Interdependent Output Variables, JMLR, Vol. 6, pages 1453-1484.
Thomas Finley and Thorsten Joachims (2008), Training Structural SVMs when Exact Inference is Intractable, ICML 2008.
Sunita Sarawagi and Rahul Gupta (2008), Accurate Max-margin Training for Structured Output Spaces, ICML 2008.
Gökhan BakIr, Ben Taskar, Thomas Hofmann, Bernhard Schölkopf, Alex Smola and SVN Vishwanathan (2007), Predicting Structured Data, MIT Press.
Vojtěch Franc and Bogdan Savchynskyy Discriminative Learning of Max-Sum Classifiers, Journal of Machine Learning Research, 9(Jan):67—104, 2008, Microtome Publishing
Kevin Murphy [1] Machine Learning, Mit Press

@@ Line 1: / Line 1: @@
-Next - GEN Gallery is a full incorporated Image Gallery plugin for Word - Press which has a Flash slideshow option. Online available for hiring are most qualified, well knowledgeable and talented Wordpress developer India from offshore Wordpress development services company. PSD files are incompatible to browsers and are suppose to be converted into wordpress compatible files so that it opens up in browser. Out of the various designs of photography identified these days, sports photography is preferred most, probably for the enjoyment and enjoyment associated with it. It is found that most of the visitors only look for the results that are displayed on the first page of the search engines and so if you get the service from professional service providers then they strive for the first page ranking of your site and improve the online visibility. <br><br>
+The '''structured support vector machine''' is a [[machine learning]] algorithm that generalizes the [[Support vector machine|Support Vector Machine]] (SVM) classifier.  Whereas the SVM classifier supports [[binary classification]], [[multiclass classification]] and [[Regression analysis|regression]], the structured SVM allows training of a classifier for general [[structured learning|structured output labels]].
-Generally, for my private income-making market websites, I will thoroughly research and discover the leading 10 most worthwhile niches to venture into. Wordpress have every reason with it which promote wordpress development. Some plugins ask users to match pictures or add numbers, and although effective, they appear unprofessional and unnecessary. So if you want to create blogs or have a website for your business or for personal reasons, you can take advantage of free Word - Press installation to get started. As soon as you start developing your Word - Press MLM website you'll see how straightforward and simple it is to create an online presence for you and the products and services you offer. <br><br>The entrepreneurs can easily captivate their readers by using these versatile themes. After sending these details, your Word - Press blog will be setup within a few days. I've applied numerous Search engine optimization-ready Word - Press themes and I can say from knowledge that I consider the Genesis Search engine marketing panel one particular of the simplest to use. These frequent updates have created menace in the task of optimization. Premium vs Customised Word - Press Themes - Premium themes are a lot like customised themes but without the customised price and without the wait. <br><br>You can add keywords but it is best to leave this alone. Russell HR Consulting provides expert knowledge in the practical application of employment law as well as providing employment law training and HR support services. However, you may not be able to find a theme that is in sync with your business. It's now become a great place to sell it thanks to Woo - Commerce. OSDI, a  Wordpress Development Company  based on ahmedabad, India. <br><br>If you have any issues relating to wherever and how to use [http://iz.sa/wordpressbackup226675 wordpress dropbox backup], you can make contact with us at our own web-page. As a open source platform Wordpress offers distinctive ready to use themes for free along with custom theme support and easy customization. Sanjeev Chuadhary is an expert writer who shares his knowledge about web development through their published articles and other resource. This allows updates to be sent anyone who wants them via an RSS reader or directly to their email. Extra investment in Drupal must be bypassed as value for money is what Drupal provides. Likewise, professional publishers with a multi author and editor setup often find that Word - Press lack basic user and role management capabilities.
+As an example, a sample instance might be a natural language sentence, and the output label is an annotated [[parse tree]].  Training a classifier consists of showing pairs of correct sample and output label pairs.  After training, the structured SVM model allows one to predict for new sample instances the corresponding output label; that is, given a natural language sentence, the classifier can produce the most likely parse tree.
+==Training==
+For a set of <math>\ell</math> training instances <math>(\boldsymbol{x}_n,y_n) \in \mathcal{X}\times\mathcal{Y}</math>, <math>n=1,\dots,\ell</math> from a sample space <math>\mathcal{X}</math> and label space <math>\mathcal{Y}</math>, the structured SVM minimizes the following regularized risk function.
+:<math>\underset{\boldsymbol{w}}{\min} \quad \|\boldsymbol{w}\|^2 + C \sum_{n=1}^{\ell}
+   \underset{y\in\mathcal{Y}}{\max} \left(\Delta(y_n,y) + \boldsymbol{w}'\Psi(\boldsymbol{x}_n,y) - \boldsymbol{w}'\Psi(\boldsymbol{x}_n,y_n)\right)</math>
+The function is convex in <math>\boldsymbol{w}</math> because the maximum of a set of affine functions is convex.  The function <math>\Delta: \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}_+</math> measures a distance in label space and is an arbitrary function (not necessarily a [[Metric (mathematics)|metric]]) satisfying <math>\Delta(y,z) \geq 0</math> and <math> \Delta(y,y)=0 \;\; \forall y,z \in \mathcal{Y}</math>.  The function <math>\Psi: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}^d</math> is a feature function, extracting some feature vector from a given sample and label.  The design of this function depends very much on the application.
+Because the regularized risk function above is non-differentiable, it is often reformulated in terms of a [[quadratic program]] by introducing one slack variable <math>\xi_n</math> for each sample, each representing the value of the maximum.  The standard structured SVM primal formulation is given as follows.
+:<math>\begin{array}{cl}
+  \underset{\boldsymbol{w},\boldsymbol{\xi}}{\min} & \|\boldsymbol{w}\|^2 + C \sum_{n=1}^{\ell} \xi_n\\
+  \textrm{s.t.} & \boldsymbol{w}' \Psi(\boldsymbol{x}_n,y_n) - \boldsymbol{w}' \Psi(\boldsymbol{x}_n,y) + \xi_n \geq \Delta(y_n,y),\qquad n=1,\dots,\ell,\quad \forall y \in \mathcal{Y}
+  \end{array}</math>
+==Inference==
+At test time, only a sample <math>\boldsymbol{x} \in \mathcal{X}</math> is known, and a prediction function <math>f: \mathcal{X} \to \mathcal{Y}</math> maps it to a predicted label from the label space <math>\mathcal{Y}</math>.  For structured SVMs, given the vector <math>\boldsymbol{w}</math> obtained from training, the prediction function is the following.
+:<math>f(\boldsymbol{x}) = \underset{y \in \mathcal{Y}}{\textrm{argmax}} \quad \boldsymbol{w}'\Psi(\boldsymbol{x},y)</math>
+Therefore, the maximizer over the label space is the predicted label.  Solving for this maximizer is the so-called inference problem and similar to making a maximum a-posteriori (MAP) prediction in probabilistic models.  Depending on the structure of the function <math>\Psi</math>, solving for the maximizer can be a hard problem.
+== Separation ==
+The above quadratic program involves a very large, possibly infinite number of linear inequality constraints.  In general, the number of inequalities is too large to be optimized over explicitly.  Instead the problem is solved by using [[delayed constraint generation]] where only a finite and small subset of the constraints is used.  Optimizing over a subset of the constraints enlarges the [[feasible set]] and will yield a solution which provides a lower bound on the objective.  To test whether the solution <math>\boldsymbol{w}</math> violates constraints of the complete set inequalities, a [[separation]]{{Disambiguation needed|date=June 2011}} problem needs to be solved.   As the inequalities decompose over the samples, for each sample <math>(\boldsymbol{x}_n,y_n)</math> the following problem needs to be solved.
+:<math>y_n^* = \underset{y \in \mathcal{Y}}{\textrm{argmax}} \left(
+   \Delta(y_n,y) + \boldsymbol{w}'\Psi(\boldsymbol{x}_n,y) - \boldsymbol{w}'\Psi(\boldsymbol{x}_n,y_n) - \xi_n\right)</math>
+The right hand side objective to be maximized is composed of the constant <math>-\boldsymbol{w}'\Psi(\boldsymbol{x}_n,y_n) - \xi_n</math> and a term dependent on the variables optimized over, namely <math>\Delta(y_n,y) + \boldsymbol{w}'\Psi(\boldsymbol{x}_n,y)</math>.  If the achieved right hand side objective is smaller or equal to zero, no violated constraint for this sample exist.  If it is strictly larger than zero, the most violated constraint with respect to this sample has been identified.  The problem is enlarged by this constraint and resolved.  The process continues until no violated inequalities can be identified.
+If the constants are dropped from the above problem, we obtain the following problem to be solved.
+:<math>y_n^* = \underset{y \in \mathcal{Y}}{\textrm{argmax}} \left(\Delta(y_n,y) + \boldsymbol{w}'\Psi(\boldsymbol{x}_n,y)\right)</math>
+This problem looks very similar to the inference problem.  The only difference is the addition of the term <math>\Delta(y_n,y)</math>.  Most often, it is chosen such that it has a natural decomposition in label space.  In that case, the influence of <math>\Delta</math> can be encoded into the inference problem and solving for the most violating constraint is equivalent to solving the inference problem.
+== References ==
+* Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann and Yasemin Altun (2005), [http://www.jmlr.org/papers/volume6/tsochantaridis05a/tsochantaridis05a.pdf Large Margin Methods for Structured and Interdependent Output Variables], JMLR, Vol. 6, pages 1453-1484.
+* Thomas Finley and Thorsten Joachims (2008), [http://icml2008.cs.helsinki.fi/papers/279.pdf Training Structural SVMs when Exact Inference is Intractable], ICML 2008.
+* Sunita Sarawagi and Rahul Gupta (2008), [http://icml2008.cs.helsinki.fi/papers/402.pdf Accurate Max-margin Training for Structured Output Spaces], ICML 2008.
+* Gökhan BakIr, Ben Taskar, Thomas Hofmann, Bernhard Schölkopf, Alex Smola and SVN Vishwanathan (2007), [http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=11332 Predicting Structured Data], MIT Press.
+* Vojtěch Franc and Bogdan Savchynskyy [http://www.jmlr.org/papers/volume9/franc08a/franc08a.pdf Discriminative Learning of Max-Sum Classifiers], 	Journal of Machine Learning Research, 9(Jan):67—104, 2008, Microtome Publishing
+* Kevin Murphy [http://www.cs.ubc.ca/~murphyk/MLbook/pml-print3-ch19.pdf] Machine Learning, Mit Press
+[[Category:Structured prediction]]
+[[Category:Support vector machines]]