Causal decision theory: Difference between revisions

From formulasearchengine
Jump to navigation Jump to search
No edit summary
 
Line 1: Line 1:
The '''exponential mechanism''' is a technique for designing differentially private algorithms developed by [http://research.microsoft.com/en-us/people/mcsherry/ Frank McSherry] and [http://research.microsoft.com/en-us/people/kunal/default.aspx Kunal Talwar]. [[Differential privacy]] is a technique for releasing statistical information about a database without revealing information about its individual entries. <br>
Most of the initial research in the field of [[differential privacy]] revolved around real valued functions which have relatively low [[Differential privacy|sensitivity]] to change in the data of a single individual and whose usefulness is not hampered by small additive perturbations. A natural question is what happens in the situation when one wants to preserve more general sets of properties. The Exponential Mechanism helps to extend the notion of differential privacy to address these issues. Moreover, it describes a class of mechanisms that includes all possible differentially private mechanisms.


== The exponential mechanism <ref>[http://research.microsoft.com/pubs/65075/mdviadp.pdf F.McSherry and K.Talwar. Mechasim Design via Differential Privacy. Proceedings of the 48th Annual Symposium of Foundations of Computer Science, 2007.]</ref> ==


It involves expertise and knowledge of various tools and technologies used for creating websites. This means you can setup your mailing list and auto-responder on your wordpress site and then you can add your subscription form to any other blog, splash page, capture page or any other site you like. * A community forum for debate of the product together with some other customers in the comments spot. In the recent years, there has been a notable rise in the number of companies hiring Indian Word - Press developers. You can easily customize the titles of the posts in Word - Press blog in a way that only title comes in the new post link and not the date or category of posts. <br><br>Word - Press is known as the most popular blogging platform all over the web and is used by millions of blog enthusiasts worldwide. You will have to invest some money into tuning up your own blog but, if done wisely, your investment will pay off in the long run. It sorts the results of a search according to category, tags and comments. Being able to help with your customers can make a change in how a great deal work, repeat online business, and referrals you'll be given. As soon as you start developing your Word - Press MLM website you'll see how straightforward and simple it is to create an online presence for you and the products and services you offer. <br><br>Just ensure that you hire experienced Word - Press CMS developer who is experienced enough to perform the task of Word - Press customization to get optimum benefits of Word - Press CMS. The only problem with most is that they only offer a monthly plan, you never own the software and you can’t even install the software on your site, you must go to another website to manage your list and edit your autoresponder. I hope this short Plugin Dynamo Review will assist you to differentiate whether Plugin Dynamo is Scam or a Genuine. User friendly features and flexibility that Word - Press has to offer is second to none. So, if you are looking online to hire dedicated Wordpress developers, India PHP Expert can give a hand you in each and every best possible way. <br><br>Google Maps Excellent navigation feature with Google Maps and latitude, for letting people who have access to your account Latitude know exactly where you are. In case you need to hire PHP developers or hire Offshore Code - Igniter development services or you are looking for Word - Press development experts then Mindfire Solutions would be the right choice for a Software Development partner. The templates are designed to be stand alone pages that have a different look and feel from the rest of your website. Can you imagine where you would be now if someone in your family bought an original painting from van Gogh during his lifetime. Wordpress template is loaded with lots of prototype that unite graphic features and content area. <br><br>Yet, overall, less than 1% of websites presently have mobile versions of their websites. I'm a large fan of using Word - Press to create pretty much any sort of web pageIf you have any thoughts pertaining to in which and how to use [http://shortener.us/wordpress_backup_plugin_4614190 wordpress dropbox backup], you can call us at our website. Offshore Wordpress development services from a legitimate source caters dedicated and professional services assistance with very simplified yet technically effective development and designing techniques from experienced professional Wordpress developer India. In addition, Word - Press design integration is also possible. Likewise, professional publishers with a multi author and editor setup often find that Word - Press lack basic user and role management capabilities.
=== Algorithm ===
In very generic terms a privacy mechanism maps a set of <math>n\,\!</math> inputs from domain <math>\mathcal{D}\,\!</math>, to a range <math>\mathcal{R}\,\!</math>. The map may be randomized, in which case each element of the domain <math>D\,\!</math> corresponds to the probability distribution over the range <math>R\,\!</math>. The privacy mechanism we are going to design makes no assumption about the nature of <math>\mathcal{D}\,\!</math> and <math>\mathcal{R}\,\!</math> apart from a base [[Measure (mathematics)|measure]] <math>\mu\,\!</math> on <math>\mathcal{R}\,\!</math>. Let us define a function <math>q:\mathcal{D}^n\times\mathcal{R}\rightarrow\mathbb{R}\,\!</math>. Intuitively this function assigns score to the pair <math>(d,r)\,\!</math>, where <math>d\in\mathcal{D}^n\,\!</math> and <math>r\in\mathcal{R}\,\!</math>. The score reflects how appealing is the pair <math>(d,r)\,\!</math>, i.e. the higher the score, the more appealing the pair is. 
Once we are given the input <math>d\in\mathcal{D}^n\,\!</math>, the mechanism's objective is to return an <math>r\in\mathcal{R}\,\!</math> such that the function <math>q(d,r)\,\!</math> is approximately maximized. To achieve this, we set up the mechanism <math>\mathcal{E}_{q}^{\epsilon}(d)\,\!</math> as follows: <br>
'''Definition:''' For any function <math>q:(\mathcal{D}^{n}\times\mathcal{R})\rightarrow\mathbb{R}\,\!</math>, and a base measure <math>\mu\,\!</math> over <math>\mathcal{R}\,\!</math>, we define:
:<math>\mathcal{E}_{q}^{\epsilon}(d):=\,\!</math> Choose <math>r\,\!</math> with probability proportional to <math>e^{\epsilon q(d,r)}\times\mu(r)\,\!</math>, where <math>d\in\mathcal{D}^n,r\in R\,\!</math>.
This definition implies the fact that the probability of returning an <math>r\,\!</math> increases exponentially with the increase in the value of <math>q(d,r)\,\!</math>. For now if we ignore the base measure <math>\mu\,\!</math> then the value <math>r\,\!</math> which maximizes <math>q(d,r)\,\!</math> has the highest probability. Moreover we claim that this mechanism is differentially private. We will prove this claim shortly. One technicality that should be kept in mind is that in order to properly define <math>\mathcal{E}_{q}^{\epsilon}(d)\,\!</math> the <math>\int_{r}e^{\epsilon q(d,r)}\times\mu(r)\,\!</math> should be finite.
 
'''Theorem (Differential Privacy):''' <math>\mathcal{E}_{q}^{\epsilon}(d)\,\!</math> gives <math>(2\epsilon\Delta q)\,\!</math>-differential privacy.
 
Proof: The probability density of <math>\mathcal{E}_{q}^{\epsilon}(d)\,\!</math> at <math>r\,\!</math> equals
:<math>\frac{e^{\epsilon q(d,r)}\mu(r)}{\int e^{\epsilon q(d,r)}\mu(r)dr}\,\!</math>.
Now, if a single change in <math>d\,\!</math> changes <math>q\,\!</math> by at most <math>\Delta q\,\!</math> then the numerator can change at most by a factor of <math>e^{\epsilon\Delta q}\,\!</math> and the denominator minimum by a factor of <math>e^{-\epsilon\Delta q}\,\!</math>. Thus, the ratio of the new probability density (i.e. with new <math>d\,\!</math>) and the earlier one is at most <math>\exp(2\epsilon\Delta q)\,\!</math>.
 
=== Accuracy ===
 
We would ideally want the random draws of <math>r\,\!</math> from the mechanism <math>\mathcal{E}_{q}^{\epsilon}(d)\,\!</math> to nearly maximize <math>q(d,r)\,\!</math>. If we consider <math>\max_rq(d,r)\,\!</math> to be <math>OPT\,\!</math> then we can show that the probability of the mechanism deviating from <math>OPT\,\!</math>is low, as long as there is a sufficient mass (in terms of <math>\mu</math>) of values <math>r\,\!</math> with value <math>q\,\!</math> close to the optimum.
 
'''Lemma:''' Let <math>S_{t}=\{r:q(d,r)>OPT-t\}\,\!</math> and <math>\bar{S}_{2t}=\{r:q(d,r)\leq OPT-2t\}\,\!</math>, we have <math>p(\bar{S}_{2t})\,\!</math> is at most <math>\exp(-\epsilon t)/\mu(S_{t})\,\!</math>. The probability is taken over <math>R\,\!</math>.
 
Proof: The probability <math>p(\bar{S}_{2t})\,\!</math> is at most <math>p(\bar{S}_{2t})/p(S_t)\,\!</math>, as the denominator can be at most one. Since both the probabilities have the same normalizing term so,
 
:<math>\frac{p(\bar{S}_{2t})}{p(S_{t})} = \frac{\int_{\bar{S}_{2t}}\exp(\epsilon q(d,r))\mu(r) \, dr}{\int_{S_{t}} \exp(\epsilon q(d,r))\mu(r) \, dr} \leq \exp(-\epsilon t) \frac{\mu(\bar{S}_{2t})}{\mu(S_t)}. </math>
 
The value of <math>\mu(\bar{S}_{2t})\,\!</math> is at most one, and so this bound implies the lemma statement.
 
'''Theorem (Accuracy):''' For those values of <math>t\geq \ln(\frac{OPT}{t\mu(S_{t})})/\epsilon\,\!</math>, we have <math> E[q(d,\mathcal{E}_{q}^{\epsilon}(d))]\geq OPT-3t\,\!</math>.
 
Proof: It follows from the previous lemma that the probability of the score being at least <math>OPT-2t\,\!</math> is <math>1-\exp(-\epsilon t)/\mu(S_{t})\,\!</math>. By Hypothesis, <math>t\geq \ln(\frac{OPT}{t\mu(S_{t})})/\epsilon\,\!</math>. Substituting the value of <math>t\,\!</math> we get this probability to be at least <math>1-t/OPT\,\!</math>. Multiplying with <math>OPT-2t\,\!</math> yields the desired bound.
 
We can assume <math>\mu(A)\,\!</math> for <math>A\subseteq \mathcal{R}\,\!</math> to be less than or equal to one in all the computations, because we can always normalize with <math>\mu(\mathcal{R})\,\!</math> .
 
== Example application of the exponential mechanism <ref>[http://www.cs.cmu.edu/~alroth/Papers/dataprivacy.pdf Avrim Blum,Katrina Ligett,Aaron Roth. A Learning Theory Approach to Non-Iteractive Database Privacy.In Proceedings of the 40th annual ACM symposium on Theory of computing, 2008]</ref>==
 
Before we get into the details of the example let us define some terms which we will be using extensively throughout our discussion.
 
'''Definition (global sensitivity):''' The global sensitivity of a query <math>Q\,\!</math> is its maximum difference when evaluated on two neighbouring datasets <math>D_1,D_2\in\mathcal{D}^n\,\!</math>:
:<math>GS_{Q}=\max_{D_1,D_2:d(D_1,D_2)=1}|(Q(D_1)-Q(D_2))|\,\!</math>.
 
'''Definition:''' A predicate query <math>Q_{\varphi}\,\!</math> for any predicate <math>\varphi\,\!</math> is defined to be
:<math>Q_{\varphi}=\frac{|\{x\in D:\varphi(x)\}|}{|D|}\,\!</math>.
 
Note that <math>GS_{Q_{\varphi}}\leq 1/n\,\!</math> for any predicate <math>\varphi\,\!</math>.
 
=== Release mechanism ===
 
The following is due to [http://www.cs.cmu.edu/~avrim/ Avrim Blum], [http://www.cs.cmu.edu/~katrina/ Katrina Ligett] and [http://www.cs.cmu.edu/~alroth/ Aaron Roth].
 
'''Definition (Usefulness):''' A [http://cryptowiki.cse.psu.edu/mediawiki/index.php/CSE546-Spring-2009/Differential-Privacy mechanism] <math>\mathcal{A}\,\!</math> is <math>(\alpha,\delta)\,\!</math>-useful for queries in class <math>H\,\!</math> with probability <math>1-\delta\,\!</math>, if <math>\forall h\in H\,\!</math> and every dataset <math>D\,\!</math>, for <math>\widehat{D}=\mathcal{A}(D)\,\!</math>, <math>|Q_h(\widehat{D})-Q_h(D)|\leq \alpha\,\!</math>.
 
Informally, it means that with high probability the query <math>Q_{h}\,\!</math> will behave in a similar way on the original dataset <math>D\,\!</math> and on the synthetic dataset <math>\widehat{D}\,\!</math>. <br>
Let us consider a common problem in Data Mining. Assume there is a database <math>D\,\!</math> with <math>n\,\!</math> entries. Each entry consist of <math>k\,\!</math>-tuples of the form <math>(x_1,x_2,\dots,x_k)\,\!</math> where <math>x_{i}\in\{0,1\}\,\!</math>. Now, a user wants to learn a [[Half-space (geometry)|linear halfspace]] of the form <math>\pi_1 x_1 + \pi_2 x_2+\cdots+\pi_{k-1}x_{k-1}\geq x_{k}\,\!</math>. In essence the user wants to figure out the values of <math>\pi_1,\pi_2,\dots,\pi_{k-1}\,\!</math> such that maximum number of tuples in the database satisfy the inequality. The algorithm we describe below can generate a synthetic database <math>\widehat{D}\,\!</math> which will allow the user to learn (approximately) the same linear half-space while querying on this synthetic database. The motivation for such an algorithm being that the new database will be generated in a differentially private manner and thus asssure privacy to the individual records in the database <math>D\,\!</math>.
 
In this section we show that it is possible to release a dataset which is useful for concepts from a polynomial [[VC dimension|VC-Dimension]] class and at the same time adhere to <math>\epsilon\,\!</math>-differential privacy as long as the size of the original dataset is at least polynomial on the [[VC dimension|VC-Dimension]] of the concept class. To state formally:
 
'''Theorem:''' For any class of functions <math>H\,\!</math> and any dataset <math>D\subset \{0,1\}^{k}\,\!</math> such that
:<math>|D|\geq O\left(\frac{k\cdot VCDIM(H)\log(1/\alpha)}{\alpha^{3}\epsilon}+\frac{\log(1/\delta)}{\alpha\epsilon}\right)\,\!</math>
we can output an <math>(\alpha,\delta)\,\!</math>-useful dataset <math>\widehat{D}\,\!</math> that preserves <math>\epsilon\,\!</math>-differential privacy. As we had mentioned earlier the algorithm need not be efficient.
 
One interesting fact is that the algorithm which we are going to develop generates a synthetic dataset whose size is independent of the original dataset; in fact, it only depends on the [[VC dimension|VC-dimension]] of the concept class and the parameter <math>\alpha\,\!</math>. The algorithm outputs a dataset of size <math>\tilde{O}(VCDIM(H)/\alpha^{2})\,\!</math>
 
We borrow the [[Uniform convergence (combinatorics)|Uniform Convergence Theorem]] from [[combinatorics]] and state a corollary of it which aligns to our need.
 
'''Lemma:''' Given any dataset <math>D\,\!</math> there exists a dataset <math>\widehat{D}\,\!</math> of size <math>=O(VCDIM(H)\log(1/\alpha))/\alpha^{2}\,\!</math> such that <math>\max_{h\in H}|Q_{h}(D)-Q_{h}(\widehat{D})|\leq \alpha/2\,\!</math>.
 
Proof:
 
We know from the uniform convergence theorem that,
 
:<math>\Pr[|Q_{h}(D)-Q_{h}(\widehat{D})|\geq \alpha/2\,\!</math> for some <math>h\in H]\leq 2(\frac{em}{VCDIM(H)})^{VCDIM(H)}\cdot e^{-\frac{\alpha^{2}m}{8}}\,\!</math>,
where probability is over the distribution of the dataset.
Thus, if the RHS is less than one then we know for sure that the data set <math>\widehat{D}\,\!</math> exists. To bound the RHS to less than one we need <math>m\geq\lambda(VCDIM(H)\log(m/VCDIM(H))/\alpha^{2})\,\!</math>, where <math>\lambda\,\!</math> is some positive constant. Since we stated earlier that we will output a dataset of size <math>\tilde{O}(VCDIM(H)/\alpha^{2})\,\!</math>, so using this bound on <math>m\,\!</math> we get <math>m\geq\lambda(VCDIM(H)\log(1/\alpha)/\alpha^{2})\,\!</math>. Hence the lemma.
 
Now we invoke the Exponential Mechanism.
 
'''Definition:''' For any function <math>q:((\{0,1\}^k)^n \times(\{0,1\}^k)^m)\rightarrow\mathbb{R}\,\!</math> and input dataset <math>D\,\!</math>, the Exponential mechanism outputs each dataset <math>\widehat{D}\,\!</math> with probability proportional to <math>e^{q(D,\widehat{D})\epsilon n/2}\,\!</math>.
 
From the Exponential Mechanism we know this preserves <math>(\epsilon nGS_{q})\,\!</math>-differential privacy.
Lets get back to the proof of the Theorem.
 
We define <math>(q(D),q(\widehat{D}))=-\max_{h\in H}|Q_{h}(D)-Q_{h}(\widehat{D})|\,\!</math>. <br>
To show that the mechanism satisfies the <math>(\alpha,\delta)\,\!</math>-usefulness, we should show that it outputs some dataset <math>\widehat{D}\,\!</math> with <math>q(D,\widehat{D})\geq -\alpha\,\!</math> with probability <math>1-\delta\,\!</math>.  
There are at most <math>2^{km}\,\!</math> output datasets and the probability that <math>q(D,\widehat{D})\leq -\alpha\,\!</math> is at most proportional to <math>e^{-\epsilon\alpha n/2}\,\!</math>. Thus by union bound, the probability of outputting any such dataset <math>\widehat{D}\,\!</math> is at most proportional to <math>2^{km}e^{-\epsilon\alpha n/2}\,\!</math>.   
Again, we know that there exists some dataset <math>\widehat{D}\in(\{0,1\}^{k})^{m}\,\!</math> for which <math>q(D,\widehat{D})\geq -\alpha/2\,\!</math>. Therefore, such a dataset is output with probability at least proportional to <math>e^{-\alpha\epsilon n/4}\,\!</math>. <br>
Let, <math>A:=\,\!</math> the event that the Exponential mechanism outputs some dataset <math>\widehat{D}\,\!</math> such that <math>q(D,\widehat{D})\geq-\alpha/2\,\!</math>. <br>
<math>B:=\,\!</math> the event that the Exponential mechanism outputs some dataset <math>\widehat{D}\,\!</math> such that <math>q(D,\widehat{D})\leq-\alpha\,\!</math>.
:<math>\therefore \frac{\Pr[A]}{\Pr[B]}\geq \frac{e^{-\alpha\epsilon n/4}}{2^{km}e^{-\alpha\epsilon n/2}}=\frac{e^{\alpha\epsilon n/4}}{2^{km}}.\,\!</math>
Now setting this quantity to be at least <math>1/\delta\geq(1-\delta)/\delta\,\!</math>, we find that it suffices to have
:<math>n\geq\frac{4}{\epsilon\alpha}\left(km+ln\frac{1}{\delta}\right)\geq O\left(\frac{d\cdot VCDIM(H)\log(1/\alpha)}{\alpha^{3}\epsilon}+\frac{\log(1/\delta)}{\alpha\epsilon}\right)\,\!</math>.
And hence we prove the theorem.
 
== The Exponential Mechanism in other domains ==
 
We just showed an example of the usage of Exponential Mechanism where one can output a synthetic dataset in a differentially private manner and can use the dataset to answer queries with good accuracy. Apart from these kinds of setting, the Exponential Mechanism has also been studied in the context of [[auction theory]] and [[Statistical classification|classification algorithms]].<ref>[http://arxiv.org/abs/0803.0924v2 Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim,Sofya Raskhodnikova, Adam Smith. What Can We Learn Privately? Proceedings of the 2008 49th Annual IEEE Symposium on Foundations of Computer Science.]</ref> In the case of auctions the Exponential Mechanism helps to achieve a ''truthful'' auction setting.
 
==References==
{{Reflist}}
 
{{DEFAULTSORT:Exponential Mechanism (Differential Privacy)}}
[[Category:Data privacy]]
[[Category:Theory of cryptography]]
[[Category:Applied probability]]

Revision as of 13:46, 15 December 2013

The exponential mechanism is a technique for designing differentially private algorithms developed by Frank McSherry and Kunal Talwar. Differential privacy is a technique for releasing statistical information about a database without revealing information about its individual entries.
Most of the initial research in the field of differential privacy revolved around real valued functions which have relatively low sensitivity to change in the data of a single individual and whose usefulness is not hampered by small additive perturbations. A natural question is what happens in the situation when one wants to preserve more general sets of properties. The Exponential Mechanism helps to extend the notion of differential privacy to address these issues. Moreover, it describes a class of mechanisms that includes all possible differentially private mechanisms.

The exponential mechanism [1]

Algorithm

In very generic terms a privacy mechanism maps a set of n inputs from domain 𝒟, to a range . The map may be randomized, in which case each element of the domain D corresponds to the probability distribution over the range R. The privacy mechanism we are going to design makes no assumption about the nature of 𝒟 and apart from a base measure μ on . Let us define a function q:𝒟n×. Intuitively this function assigns score to the pair (d,r), where d𝒟n and r. The score reflects how appealing is the pair (d,r), i.e. the higher the score, the more appealing the pair is. Once we are given the input d𝒟n, the mechanism's objective is to return an r such that the function q(d,r) is approximately maximized. To achieve this, we set up the mechanism qϵ(d) as follows:
Definition: For any function q:(𝒟n×), and a base measure μ over , we define:

qϵ(d):= Choose r with probability proportional to eϵq(d,r)×μ(r), where d𝒟n,rR.

This definition implies the fact that the probability of returning an r increases exponentially with the increase in the value of q(d,r). For now if we ignore the base measure μ then the value r which maximizes q(d,r) has the highest probability. Moreover we claim that this mechanism is differentially private. We will prove this claim shortly. One technicality that should be kept in mind is that in order to properly define qϵ(d) the reϵq(d,r)×μ(r) should be finite.

Theorem (Differential Privacy): qϵ(d) gives (2ϵΔq)-differential privacy.

Proof: The probability density of qϵ(d) at r equals

eϵq(d,r)μ(r)eϵq(d,r)μ(r)dr.

Now, if a single change in d changes q by at most Δq then the numerator can change at most by a factor of eϵΔq and the denominator minimum by a factor of eϵΔq. Thus, the ratio of the new probability density (i.e. with new d) and the earlier one is at most exp(2ϵΔq).

Accuracy

We would ideally want the random draws of r from the mechanism qϵ(d) to nearly maximize q(d,r). If we consider maxrq(d,r) to be OPT then we can show that the probability of the mechanism deviating from OPTis low, as long as there is a sufficient mass (in terms of μ) of values r with value q close to the optimum.

Lemma: Let St={r:q(d,r)>OPTt} and S¯2t={r:q(d,r)OPT2t}, we have p(S¯2t) is at most exp(ϵt)/μ(St). The probability is taken over R.

Proof: The probability p(S¯2t) is at most p(S¯2t)/p(St), as the denominator can be at most one. Since both the probabilities have the same normalizing term so,

p(S¯2t)p(St)=S¯2texp(ϵq(d,r))μ(r)drStexp(ϵq(d,r))μ(r)drexp(ϵt)μ(S¯2t)μ(St).

The value of μ(S¯2t) is at most one, and so this bound implies the lemma statement.

Theorem (Accuracy): For those values of tln(OPTtμ(St))/ϵ, we have E[q(d,qϵ(d))]OPT3t.

Proof: It follows from the previous lemma that the probability of the score being at least OPT2t is 1exp(ϵt)/μ(St). By Hypothesis, tln(OPTtμ(St))/ϵ. Substituting the value of t we get this probability to be at least 1t/OPT. Multiplying with OPT2t yields the desired bound.

We can assume μ(A) for A to be less than or equal to one in all the computations, because we can always normalize with μ() .

Example application of the exponential mechanism [2]

Before we get into the details of the example let us define some terms which we will be using extensively throughout our discussion.

Definition (global sensitivity): The global sensitivity of a query Q is its maximum difference when evaluated on two neighbouring datasets D1,D2𝒟n:

GSQ=maxD1,D2:d(D1,D2)=1|(Q(D1)Q(D2))|.

Definition: A predicate query Qφ for any predicate φ is defined to be

Qφ=|{xD:φ(x)}||D|.

Note that GSQφ1/n for any predicate φ.

Release mechanism

The following is due to Avrim Blum, Katrina Ligett and Aaron Roth.

Definition (Usefulness): A mechanism 𝒜 is (α,δ)-useful for queries in class H with probability 1δ, if hH and every dataset D, for D^=𝒜(D), |Qh(D^)Qh(D)|α.

Informally, it means that with high probability the query Qh will behave in a similar way on the original dataset D and on the synthetic dataset D^.
Let us consider a common problem in Data Mining. Assume there is a database D with n entries. Each entry consist of k-tuples of the form (x1,x2,,xk) where xi{0,1}. Now, a user wants to learn a linear halfspace of the form π1x1+π2x2++πk1xk1xk. In essence the user wants to figure out the values of π1,π2,,πk1 such that maximum number of tuples in the database satisfy the inequality. The algorithm we describe below can generate a synthetic database D^ which will allow the user to learn (approximately) the same linear half-space while querying on this synthetic database. The motivation for such an algorithm being that the new database will be generated in a differentially private manner and thus asssure privacy to the individual records in the database D.

In this section we show that it is possible to release a dataset which is useful for concepts from a polynomial VC-Dimension class and at the same time adhere to ϵ-differential privacy as long as the size of the original dataset is at least polynomial on the VC-Dimension of the concept class. To state formally:

Theorem: For any class of functions H and any dataset D{0,1}k such that

|D|O(kVCDIM(H)log(1/α)α3ϵ+log(1/δ)αϵ)

we can output an (α,δ)-useful dataset D^ that preserves ϵ-differential privacy. As we had mentioned earlier the algorithm need not be efficient.

One interesting fact is that the algorithm which we are going to develop generates a synthetic dataset whose size is independent of the original dataset; in fact, it only depends on the VC-dimension of the concept class and the parameter α. The algorithm outputs a dataset of size O~(VCDIM(H)/α2)

We borrow the Uniform Convergence Theorem from combinatorics and state a corollary of it which aligns to our need.

Lemma: Given any dataset D there exists a dataset D^ of size =O(VCDIM(H)log(1/α))/α2 such that maxhH|Qh(D)Qh(D^)|α/2.

Proof:

We know from the uniform convergence theorem that,

Pr[|Qh(D)Qh(D^)|α/2 for some hH]2(emVCDIM(H))VCDIM(H)eα2m8,

where probability is over the distribution of the dataset. Thus, if the RHS is less than one then we know for sure that the data set D^ exists. To bound the RHS to less than one we need mλ(VCDIM(H)log(m/VCDIM(H))/α2), where λ is some positive constant. Since we stated earlier that we will output a dataset of size O~(VCDIM(H)/α2), so using this bound on m we get mλ(VCDIM(H)log(1/α)/α2). Hence the lemma.

Now we invoke the Exponential Mechanism.

Definition: For any function q:(({0,1}k)n×({0,1}k)m) and input dataset D, the Exponential mechanism outputs each dataset D^ with probability proportional to eq(D,D^)ϵn/2.

From the Exponential Mechanism we know this preserves (ϵnGSq)-differential privacy. Lets get back to the proof of the Theorem.

We define (q(D),q(D^))=maxhH|Qh(D)Qh(D^)|.
To show that the mechanism satisfies the (α,δ)-usefulness, we should show that it outputs some dataset D^ with q(D,D^)α with probability 1δ. There are at most 2km output datasets and the probability that q(D,D^)α is at most proportional to eϵαn/2. Thus by union bound, the probability of outputting any such dataset D^ is at most proportional to 2kmeϵαn/2. Again, we know that there exists some dataset D^({0,1}k)m for which q(D,D^)α/2. Therefore, such a dataset is output with probability at least proportional to eαϵn/4.
Let, A:= the event that the Exponential mechanism outputs some dataset D^ such that q(D,D^)α/2.
B:= the event that the Exponential mechanism outputs some dataset D^ such that q(D,D^)α.

Pr[A]Pr[B]eαϵn/42kmeαϵn/2=eαϵn/42km.

Now setting this quantity to be at least 1/δ(1δ)/δ, we find that it suffices to have

n4ϵα(km+ln1δ)O(dVCDIM(H)log(1/α)α3ϵ+log(1/δ)αϵ).

And hence we prove the theorem.

The Exponential Mechanism in other domains

We just showed an example of the usage of Exponential Mechanism where one can output a synthetic dataset in a differentially private manner and can use the dataset to answer queries with good accuracy. Apart from these kinds of setting, the Exponential Mechanism has also been studied in the context of auction theory and classification algorithms.[3] In the case of auctions the Exponential Mechanism helps to achieve a truthful auction setting.

References

43 year old Petroleum Engineer Harry from Deep River, usually spends time with hobbies and interests like renting movies, property developers in singapore new condominium and vehicle racing. Constantly enjoys going to destinations like Camino Real de Tierra Adentro.