SimRank: Difference between revisions

From formulasearchengine
Jump to navigation Jump to search
No edit summary
m clean up, References after punctuation per WP:REFPUNC and WP:PAIC using AWB (8748)
Line 1: Line 1:
It involves expertise and knowledge of various tools and technologies used for creating websites. This means you can setup your mailing list and auto-responder on your wordpress site and then you can add your subscription form to any other blog, splash page, capture page or any other site you like. Your parishioners and certainly interested audience can come in to you for further information from the group and sometimes even approaching happenings and systems with the church. They found out all the possible information about bringing up your baby and save money at the same time. You can customize the appearance with PSD to Word - Press conversion ''. <br><br>purcase and download - WPZOOM Tribune wordpress Theme, find and use the WPZOOM Discount Code. You do not catch a user's attention through big and large pictures that usually takes a millennium to load up. This may possibly also permit it currently being seriously straightforward to modify the hues within your Ad - Sense code so the ads blend nicely with the many term broad internet word wide web web page in case you can come to your conclusion to run the adverts. By purchasing Word - Press weblogs you can acquire your very own domain title and have total command of your web site. But in case you want some theme or plugin in sync with your business needs, it is advisable that you must seek some professional help. <br><br>If you cherished this article and you would like to get extra data concerning [ backup plugin] kindly visit the page. This gives a clearer picture that online shoppers are familiar with the WP ecommerce system. s cutthroat competition prevailing in the online space won. all the necessary planning and steps of conversion is carried out in this phase, such as splitting, slicing, CSS code, adding images, header footer etc. Thousands of plugins are available in Word - Press plugin's library which makes the task of selecting right set of plugins for your website a very tedious task. Converting HTML to Word - Press theme for your website can allow you to enjoy the varied Word - Press features that aid in consistent growth your online business. <br><br>The primary differences are in the plugins that I install, as all sites don't need all the normal plugins. Russell HR Consulting provides expert knowledge in the practical application of employment law as well as providing employment law training and HR support services. Some examples of its additional features include; code inserter (for use with adding Google Analytics, Adsense section targeting etc) Webmaster verification assistant, Link Mask Generator, Robots. It supports backup scheduling and allows you to either download the backup file or email it to you. The Pakistani culture is in demand of a main surgical treatment. <br><br>Internet is not only the source for information, it is also one of the source for passive income. Sanjeev Chuadhary is an expert writer who shares his knowledge about web development through their published articles and other resource. In simple words, this step can be interpreted as the planning phase of entire PSD to wordpress conversion process. This is because of the customization that works as a keystone for a SEO friendly blogging portal website. Likewise, professional publishers with a multi author and editor setup often find that Word - Press lack basic user and role management capabilities.
'''Discounted cumulative gain''' ('''DCG''') is a measure of effectiveness of a [[World Wide Web|Web]] [[search engine]] [[algorithm]] or related applications, often used in [[information retrieval]]. Using a [[Relevance (information retrieval)|graded relevance]] scale of documents in a search engine result set, DCG measures the usefulness, or ''gain'', of a document based on its position in the result list. The gain is accumulated from the top of the result list to the bottom with the gain of each result discounted at lower ranks.<ref>Kalervo Jarvelin, Jaana Kekalainen: Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems 20(4), 422–446 (2002)</ref>
== Overview ==
Two assumptions are made in using DCG and its related measures.
# Highly relevant documents are more useful when appearing earlier in a search engine result list (have higher ranks)
# Highly relevant documents are more useful than marginally relevant documents, which are in turn more useful than irrelevant documents.
DCG originates from an earlier, more primitive, measure called Cumulative Gain.
=== Cumulative Gain ===
Cumulative Gain (CG) is the predecessor of DCG and does not include the position of a result in the consideration of the usefulness of a result set. In this way, it is the sum of the graded relevance values of all results in a search result list. The CG at a particular rank position <math>p</math> is defined as:
:<math> \mathrm{CG_{p}} = \sum_{i=1}^{p} rel_{i} </math>
Where <math>rel_{i}</math> is the graded relevance of the result at position <math>i</math>.
The value computed with the CG function is unaffected by changes in the ordering of search results. That is, moving a highly relevant document <math>d_{i}</math> above a higher ranked, less relevant, document <math>d_{j}</math> does not change the computed value for CG. Based on the two assumptions made above about the usefulness of search results, DCG is used in place of CG for a more accurate measure.
=== Discounted Cumulative Gain ===
The premise of DCG is that highly relevant documents appearing lower in a search result list should be penalized as the graded relevance value is reduced logarithmically proportional to the position of the result. The discounted CG accumulated at a particular rank position <math>p</math> is defined as:
:<math> \mathrm{DCG_{p}} = rel_{1} + \sum_{i=2}^{p} \frac{rel_{i}}{\log_{2}(i)} </math>
Previously there has not been shown any theoretically sound justification for using a [[logarithm|logarithmic]] reduction factor<ref>{{cite book | title=Search Engines: Information Retrieval in Practice | author=B. Croft, D. Metzler, and T. Strohman |date=2009 | publisher=''Addison Wesley"}}</ref> other than the fact that it produces a smooth reduction. An alternative formulation of DCG<ref>Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning (ICML '05). ACM, New York, NY, USA, 89-96. DOI=10.1145/1102351.1102363</ref> places stronger emphasis on retrieving relevant documents:
:<math> \mathrm{DCG_{p}} = \sum_{i=1}^{p} \frac{ 2^{rel_{i}} - 1 }{ \log_{2}(i+1)} </math>
In Croft, Metzler and Strohman (page 320, 2010), the authors mistakenly claim that these two formulations of DCG are the same when the relevance values of documents are [[binary function|binary]]; <math>rel_{i} \in \{0,1\}</math>.  To see that they are not the same, let there be one relevant document and that relevant document is at rank 2.  The first version of DCG equals 1 / log2(2) = 1.  The second version of DCG equals 1 / log2(2+1) = 0.631. The way that the two formulations of DCG are the same for binary judgments is in the way gain in the numerator is calculated. For both formulations of DCG, binary relevance produces gain at rank i of 0 or 1.  No matter the number of relevance grades, the two formulations differ in their discount of gain.  Note that Croft et al. (2010) and Burges et al. (2005) present the second DCG with a log of base e, while both versions of DCG above use a log of base 2.  When computing NDCG with the second formulation of DCG, the base of the log does not matter, but the base of the log does affect the value of NDCG for the first formulation. Clearly, the base of the log affects the value of DCG in both formulations.
Recently, Wang et al.(2013)<ref>Yining Wang, Liwei Wang, Yuanzhi Li, Di He, Wei Chen, Tie-Yan Liu. 2013. A Theoretical Analysis of NDCG Ranking Measures. In Proceedings of the 26th Annual Conference on Learning Theory (COLT 2013).</ref> give theoretical guarantee for using the logarithmic reduction factor in NDCG. Specifically, the authors prove for every pair of substantially different ranking functions, the ranking measure can decide which one is better in a consistent manner on almost all datasets.
=== Normalized DCG ===
Search result lists vary in length depending on the [[Web search query|query]]. Comparing a search engine's performance from one query to the next cannot be consistently achieved using DCG alone, so the cumulative gain at each position for a chosen value of <math>p</math> should be normalized across queries. This is done by sorting documents of a result list by relevance, producing the maximum possible DCG till position <math>p</math>, also called Ideal DCG (IDCG) till that position. For a query, the ''normalized discounted cumulative gain'', or nDCG, is computed as:
:<math> \mathrm{nDCG_{p}} = \frac{DCG_{p}}{IDCG_{p}} </math>
The nDCG values for all queries can be averaged to obtain a measure of the average performance of a search engine's ranking algorithm. Note that in a perfect ranking algorithm, the <math>DCG_p</math> will be the same as the <math>IDCG_p</math> producing an nDCG of 1.0. All nDCG calculations are then relative values on the interval 0.0 to 1.0 and so are cross-query comparable.
The main difficulty encountered in using nDCG is the unavailability of an ideal ordering of results when only partial [[relevance feedback]] is available.
== Example ==
Presented with a list of documents in response to a search query, an experiment participant is asked to judge the relevance of each document to the query. Each document is to be judged on a scale of 0-3 with 0 meaning irrelevant, 3 meaning completely relevant, and 1 and 2 meaning "somewhere in between". For the documents ordered by the ranking algorithm as
:<math> D_{1}, D_{2}, D_{3}, D_{4}, D_{5}, D_{6} </math>
the user provides the following relevance scores:
:<math> 3, 2, 3, 0, 1, 2 </math>
That is: document 1 has a relevance of 3, document 2 has a relevance of 2, etc. The Cumulative Gain of this search result listing is:
:<math> \mathrm{CG_{p}} = \sum_{i=1}^{p} rel_{i} = 3 + 2 + 3 + 0 + 1 + 2 = 11</math>
Changing the order of any two documents does not affect the CG measure. If <math>D_3</math> and <math>D_4</math> are switched, the CG remains the same, 11. DCG is used to emphasize highly relevant documents appearing early in the result list. Using the logarithmic scale for reduction, the DCG for each result in order is:
{| class="wikitable" border="1"
! <math>i</math>
! <math>rel_{i}</math>
! <math>\log_{2}i</math>
! <math> \frac{rel_{i}}{\log_{2}i} </math>
| 1
| 3
| 0
| N/A
| 2
| 2
| 1
| 2
| 3
| 3
| 1.585
| 1.892
| 4
| 0
| 2.0
| 0
| 5
| 1
| 2.322
| 0.431
| 6
| 2
| 2.584
| 0.774
So the <math>DCG_{6}</math> of this ranking is:
:<math> \mathrm{DCG_{6}} = rel_{1} + \sum_{i=2}^{6} \frac{rel_{i}}{\log_{2}i} = 3 + (2 + 1.892 + 0 + 0.431 + 0.774) = 8.10</math>
Now a switch of <math>D_3</math> and <math>D_4</math> results in a reduced DCG because a less relevant document is placed higher in the ranking; that is, a more relevant document is discounted more by being placed in a lower rank.
The performance of this query to another is incomparable in this form since the other query may have more results, resulting in a larger overall DCG which may not necessarily be better. In order to compare, the DCG values must be normalized.
To normalize DCG values, an ideal ordering for the given query is needed. For this example, that ordering would be the [[Monotonic|monotonically decreasing]] sort of the relevance judgments provided by the experiment participant, which is:
:<math> 3, 3, 2, 2, 1, 0 </math>
The DCG of this ideal ordering, or ''IDCG'', is then:
:<math> \mathrm{IDCG_{6}} = 8.69 </math>
And so the nDCG for this query is given as:
:<math> \mathrm{nDCG_{6}} = \frac{DCG_{6}}{IDCG_{6}} = \frac{8.10}{8.69} = 0.932 </math>
== Limitations ==
# Normalized DCG metric does not penalize for bad documents in the result. For example, if a query returns two results with scores <math> 1,1,1 </math> and <math> 1,1,1,0 </math> respectively, both would be considered equally good even if later contains a bad result. One way to take into account this limitation is use <math>1 - 2^{rel_{i}}</math> in numerator for scores for which we want to penalize and <math>2^{rel_{i}} - 1</math> for all others. For example, for the ranking judgments <math>Excellent, Fair, Bad</math> one might use numerical scores <math>1,0,-1</math> instead of <math>2,1,0</math>.
# Normalized DCG does not penalize for missing documents in the result. For example, if a query returns two results with scores <math> 1,1,1 </math> and <math> 1,1,1,1,1 </math> respectively, both would be considered equally good. One way to take into account this limitation is to enforce fixed set size for the result set and use minimum scores for the missing documents. In previous example, we would use the scores scores <math> 1,1,1,0,0 </math> and <math> 1,1,1,1,1 </math> and quote nDCG as nDCG@5.
# Normalized DCG may not be suitable to measure performance of queries that may typically often have several equally good results. This is especially true when this metric is limited to only first few results as it is done in practice. For example, for queries such as "restaurants" nDCG@1 would account for only first result and hence if one result set contains only 1 restaurant from the nearby area while the other contains 5, both would end up having same score even though later is more comprehensive.
== References ==
[[Category:Information retrieval|*]]

Revision as of 22:10, 5 December 2012

Discounted cumulative gain (DCG) is a measure of effectiveness of a Web search engine algorithm or related applications, often used in information retrieval. Using a graded relevance scale of documents in a search engine result set, DCG measures the usefulness, or gain, of a document based on its position in the result list. The gain is accumulated from the top of the result list to the bottom with the gain of each result discounted at lower ranks.[1]


Two assumptions are made in using DCG and its related measures.

  1. Highly relevant documents are more useful when appearing earlier in a search engine result list (have higher ranks)
  2. Highly relevant documents are more useful than marginally relevant documents, which are in turn more useful than irrelevant documents.

DCG originates from an earlier, more primitive, measure called Cumulative Gain.

Cumulative Gain

Cumulative Gain (CG) is the predecessor of DCG and does not include the position of a result in the consideration of the usefulness of a result set. In this way, it is the sum of the graded relevance values of all results in a search result list. The CG at a particular rank position is defined as:

Where is the graded relevance of the result at position .

The value computed with the CG function is unaffected by changes in the ordering of search results. That is, moving a highly relevant document above a higher ranked, less relevant, document does not change the computed value for CG. Based on the two assumptions made above about the usefulness of search results, DCG is used in place of CG for a more accurate measure.

Discounted Cumulative Gain

The premise of DCG is that highly relevant documents appearing lower in a search result list should be penalized as the graded relevance value is reduced logarithmically proportional to the position of the result. The discounted CG accumulated at a particular rank position is defined as:

Previously there has not been shown any theoretically sound justification for using a logarithmic reduction factor[2] other than the fact that it produces a smooth reduction. An alternative formulation of DCG[3] places stronger emphasis on retrieving relevant documents:

In Croft, Metzler and Strohman (page 320, 2010), the authors mistakenly claim that these two formulations of DCG are the same when the relevance values of documents are binary; . To see that they are not the same, let there be one relevant document and that relevant document is at rank 2. The first version of DCG equals 1 / log2(2) = 1. The second version of DCG equals 1 / log2(2+1) = 0.631. The way that the two formulations of DCG are the same for binary judgments is in the way gain in the numerator is calculated. For both formulations of DCG, binary relevance produces gain at rank i of 0 or 1. No matter the number of relevance grades, the two formulations differ in their discount of gain. Note that Croft et al. (2010) and Burges et al. (2005) present the second DCG with a log of base e, while both versions of DCG above use a log of base 2. When computing NDCG with the second formulation of DCG, the base of the log does not matter, but the base of the log does affect the value of NDCG for the first formulation. Clearly, the base of the log affects the value of DCG in both formulations.

Recently, Wang et al.(2013)[4] give theoretical guarantee for using the logarithmic reduction factor in NDCG. Specifically, the authors prove for every pair of substantially different ranking functions, the ranking measure can decide which one is better in a consistent manner on almost all datasets.

Normalized DCG

Search result lists vary in length depending on the query. Comparing a search engine's performance from one query to the next cannot be consistently achieved using DCG alone, so the cumulative gain at each position for a chosen value of should be normalized across queries. This is done by sorting documents of a result list by relevance, producing the maximum possible DCG till position , also called Ideal DCG (IDCG) till that position. For a query, the normalized discounted cumulative gain, or nDCG, is computed as:

The nDCG values for all queries can be averaged to obtain a measure of the average performance of a search engine's ranking algorithm. Note that in a perfect ranking algorithm, the will be the same as the producing an nDCG of 1.0. All nDCG calculations are then relative values on the interval 0.0 to 1.0 and so are cross-query comparable.

The main difficulty encountered in using nDCG is the unavailability of an ideal ordering of results when only partial relevance feedback is available.


Presented with a list of documents in response to a search query, an experiment participant is asked to judge the relevance of each document to the query. Each document is to be judged on a scale of 0-3 with 0 meaning irrelevant, 3 meaning completely relevant, and 1 and 2 meaning "somewhere in between". For the documents ordered by the ranking algorithm as

the user provides the following relevance scores:

That is: document 1 has a relevance of 3, document 2 has a relevance of 2, etc. The Cumulative Gain of this search result listing is:

Changing the order of any two documents does not affect the CG measure. If and are switched, the CG remains the same, 11. DCG is used to emphasize highly relevant documents appearing early in the result list. Using the logarithmic scale for reduction, the DCG for each result in order is:

1 3 0 N/A
2 2 1 2
3 3 1.585 1.892
4 0 2.0 0
5 1 2.322 0.431
6 2 2.584 0.774

So the of this ranking is:

Now a switch of and results in a reduced DCG because a less relevant document is placed higher in the ranking; that is, a more relevant document is discounted more by being placed in a lower rank.

The performance of this query to another is incomparable in this form since the other query may have more results, resulting in a larger overall DCG which may not necessarily be better. In order to compare, the DCG values must be normalized.

To normalize DCG values, an ideal ordering for the given query is needed. For this example, that ordering would be the monotonically decreasing sort of the relevance judgments provided by the experiment participant, which is:

The DCG of this ideal ordering, or IDCG, is then:

And so the nDCG for this query is given as:


  1. Normalized DCG metric does not penalize for bad documents in the result. For example, if a query returns two results with scores and respectively, both would be considered equally good even if later contains a bad result. One way to take into account this limitation is use in numerator for scores for which we want to penalize and for all others. For example, for the ranking judgments one might use numerical scores instead of .
  2. Normalized DCG does not penalize for missing documents in the result. For example, if a query returns two results with scores and respectively, both would be considered equally good. One way to take into account this limitation is to enforce fixed set size for the result set and use minimum scores for the missing documents. In previous example, we would use the scores scores and and quote nDCG as nDCG@5.
  3. Normalized DCG may not be suitable to measure performance of queries that may typically often have several equally good results. This is especially true when this metric is limited to only first few results as it is done in practice. For example, for queries such as "restaurants" nDCG@1 would account for only first result and hence if one result set contains only 1 restaurant from the nearby area while the other contains 5, both would end up having same score even though later is more comprehensive.


43 year old Petroleum Engineer Harry from Deep River, usually spends time with hobbies and interests like renting movies, property developers in singapore new condominium and vehicle racing. Constantly enjoys going to destinations like Camino Real de Tierra Adentro.

  1. Kalervo Jarvelin, Jaana Kekalainen: Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems 20(4), 422–446 (2002)
  2. 20 year-old Real Estate Agent Rusty from Saint-Paul, has hobbies and interests which includes monopoly, property developers in singapore and poker. Will soon undertake a contiki trip that may include going to the Lower Valley of the Omo.

    My blog:
  3. Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning (ICML '05). ACM, New York, NY, USA, 89-96. DOI=10.1145/1102351.1102363
  4. Yining Wang, Liwei Wang, Yuanzhi Li, Di He, Wei Chen, Tie-Yan Liu. 2013. A Theoretical Analysis of NDCG Ranking Measures. In Proceedings of the 26th Annual Conference on Learning Theory (COLT 2013).