Thermal effusivity: Difference between revisions

From formulasearchengine
Jump to navigation Jump to search
en>Stonovic
 
No edit summary
Line 1: Line 1:
Andrew Berryhill is what his wife loves to call him and he totally digs that title. What me and my family love is to climb but I'm considering on starting some thing new. Mississippi is the only place I've been residing in but I will have to transfer in a year or two. Credit authorising is how she makes a living.<br><br>Here is my blog post :: tarot  psychic solutions by lynne, [http://alles-herunterladen.de/excellent-advice-for-picking-the-ideal-hobby/ alles-herunterladen.de], card [http://ustanford.com/index.php?do=/profile-38218/info/ psychic phone readings], [http://si.dgmensa.org/xe/index.php?document_srl=48014&mid=c0102 http://si.dgmensa.org/],
In [[information retrieval]], '''Okapi BM25''' is a [[ranking function]] used by [[search engine]]s to rank matching documents according to their [[Relevance (information retrieval)|relevance]] to a given search query. It is based on the [[Probabilistic relevance model|probabilistic retrieval framework]] developed in the 1970s and 1980s by [[Stephen E. Robertson]], [[Karen Spärck Jones]], and others.
 
The name of the actual ranking function is BM25. To set the right context, however, it usually referred to as "Okapi BM25", since the Okapi information retrieval system, implemented at [[London]]'s [[City University, London|City University]] in the 1980s and 1990s, was the first system to implement this function.
 
BM25, and its newer variants, e.g. BM25F (a version of BM25 that can take document structure and anchor text into account), represent state-of-the-art [[TF-IDF]]-like retrieval functions used in document retrieval, such as Web search.
 
== The ranking function ==
 
BM25 is a [[Bag of words model|bag-of-words]] retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document (e.g., their relative proximity). It is not a single function, but actually a whole family of scoring functions, with slightly different components and parameters. One of the most prominent instantiations of the function is as follows.
 
Given a query <math>Q</math>, containing keywords <math>q_1, ..., q_n</math>, the BM25 score of a document <math>D</math> is:
 
:<math> \text{score}(D,Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{\text{avgdl}})},</math>
 
where <math>f(q_i, D)</math> is <math>q_i</math>'s [[term frequency]] in the document <math>D</math>, <math>|D|</math> is the length of the document <math>D</math> in words, and <math>avgdl</math> is the average document length in the text collection from which documents are drawn. <math>k_1</math> and <math>b</math> are free parameters, usually chosen, in absence of an advanced optimization, as <math>k_1 \in [1.2,2.0]</math> and <math>b = 0.75</math>.<ref>Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. ''An Introduction to Information Retrieval'', Cambridge University Press, 2009, p. 233.</ref> <math>\text{IDF}(q_i)</math> is the IDF ([[inverse document frequency]]) weight of the query term <math>q_i</math>. It is usually computed as:
 
:<math>\text{IDF}(q_i) = \log \frac{N - n(q_i) + 0.5}{n(q_i) + 0.5},</math>
 
where <math>N</math> is the total number of documents in the collection, and <math>n(q_i)</math> is the number of documents containing <math>q_i</math>.
 
There are several interpretations for IDF and slight variations on its formula. In the original BM25 derivation, the IDF component is derived from the [[Binary Independence Model]].
 
Please note that the above formula for IDF shows potentially major drawbacks when using it for terms appearing in more than half of the corpus documents. These terms' IDF is negative, so for any two almost-identical documents, one which contains the term and one which does not contain it, the latter will possibly get a larger score.
This means that terms appearing in more than half of the corpus will provide negative contributions to the final document score. This is often an undesirable behavior, so many real-world applications would deal with this IDF formula in a different way:
 
* Each summand can be given a floor of 0, to trim out common terms;
* The IDF function can be given a floor of a constant <math>\epsilon</math>, to avoid common terms being ignored at all;
* The IDF function can be replaced with a similarly shaped one which is non-negative, or strictly positive to avoid terms being ignored at all.
 
== IDF Information Theoretic Interpretation ==
Here is an interpretation from information theory. Suppose a query term <math>q</math> appears in <math>n(q)</math> documents. Then a randomly picked document <math>D</math> will contain the term with probability <math>\frac{n(q)}{N}</math> (where <math>N</math> is again the cardinality of the set of documents in the collection). Therefore, the [[information]] content of the message "<math>D</math> contains <math>q</math>" is:
 
:<math>-\log \frac{n(q)}{N} = \log \frac{N}{n(q)}.</math>
 
Now suppose we have two query terms <math>q_1</math> and <math>q_2</math>. If the two terms occur in documents entirely independently of each other, then the probability of seeing both <math>q_1</math> and <math>q_2</math> in a randomly picked document <math>D</math> is:
 
:<math>\frac{n(q_1)}{N} \cdot \frac{n(q_2)}{N},</math>
 
and the information content of such an event is:
 
:<math>\sum_{i=1}^{2} \log \frac{N}{n(q_i)}.</math>
 
With a small variation, this is exactly what is expressed by the IDF component of BM25.
 
== Modifications ==
* At the extreme values of the coefficient <math>b</math> BM25 turns into ranking functions known as '''BM11''' (for <math>b=1</math>) and '''BM15''' (for <math>b=0</math>).<ref>http://xapian.org/docs/bm25.html</ref>
* '''BM25F'''<ref>Hugo Zaragoza, Nick Craswell, Michael Taylor, Suchi Saria, and Stephen Robertson. [http://trec.nist.gov/pubs/trec13/papers/microsoft-cambridge.web.hard.pdf ''Microsoft Cambridge at TREC-13: Web and HARD tracks.''] In Proceedings of TREC-2004.</ref> is a modification of BM25 in which the document is considered to be composed from several fields (such as headlines, main text, anchor text) with possibly different degrees of importance.
 
== Footnotes ==
{{Reflist}}
 
== References ==
* Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. In Proceedings of the Third Text REtrieval Conference (TREC 1994). Gaithersburg, USA, November 1994.
 
* Stephen E. Robertson, Steve Walker, and Micheline Hancock-Beaulieu. Okapi at TREC-7. In Proceedings of the Seventh Text REtrieval Conference. Gaithersburg, USA, November 1998.
 
* Karen Spärck Jones, Steve Walker, and Stephen E. Robertson. A Probabilistic Model of Information Retrieval: Development and Comparative Experiments (parts 1 and 2). Information Processing and Management, 36(6):779-840. 2000.
 
== External links ==
* [http://www.soi.city.ac.uk/~ser/papers/foundations_bm25_review.pdf The Probabilistic Relevance Framework: BM25 and Beyond]
* [http://nlp.uned.es/~jperezi/Lucene-BM25/ BM25 implementation for Lucene]
 
[[Category:Ranking functions]]
[[Category:Information retrieval]]

Revision as of 10:43, 28 April 2013

In information retrieval, Okapi BM25 is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s by Stephen E. Robertson, Karen Spärck Jones, and others.

The name of the actual ranking function is BM25. To set the right context, however, it usually referred to as "Okapi BM25", since the Okapi information retrieval system, implemented at London's City University in the 1980s and 1990s, was the first system to implement this function.

BM25, and its newer variants, e.g. BM25F (a version of BM25 that can take document structure and anchor text into account), represent state-of-the-art TF-IDF-like retrieval functions used in document retrieval, such as Web search.

The ranking function

BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document (e.g., their relative proximity). It is not a single function, but actually a whole family of scoring functions, with slightly different components and parameters. One of the most prominent instantiations of the function is as follows.

Given a query Q, containing keywords q1,...,qn, the BM25 score of a document D is:

score(D,Q)=i=1nIDF(qi)f(qi,D)(k1+1)f(qi,D)+k1(1b+b|D|avgdl),

where f(qi,D) is qi's term frequency in the document D, |D| is the length of the document D in words, and avgdl is the average document length in the text collection from which documents are drawn. k1 and b are free parameters, usually chosen, in absence of an advanced optimization, as k1[1.2,2.0] and b=0.75.[1] IDF(qi) is the IDF (inverse document frequency) weight of the query term qi. It is usually computed as:

IDF(qi)=logNn(qi)+0.5n(qi)+0.5,

where N is the total number of documents in the collection, and n(qi) is the number of documents containing qi.

There are several interpretations for IDF and slight variations on its formula. In the original BM25 derivation, the IDF component is derived from the Binary Independence Model.

Please note that the above formula for IDF shows potentially major drawbacks when using it for terms appearing in more than half of the corpus documents. These terms' IDF is negative, so for any two almost-identical documents, one which contains the term and one which does not contain it, the latter will possibly get a larger score. This means that terms appearing in more than half of the corpus will provide negative contributions to the final document score. This is often an undesirable behavior, so many real-world applications would deal with this IDF formula in a different way:

  • Each summand can be given a floor of 0, to trim out common terms;
  • The IDF function can be given a floor of a constant ϵ, to avoid common terms being ignored at all;
  • The IDF function can be replaced with a similarly shaped one which is non-negative, or strictly positive to avoid terms being ignored at all.

IDF Information Theoretic Interpretation

Here is an interpretation from information theory. Suppose a query term q appears in n(q) documents. Then a randomly picked document D will contain the term with probability n(q)N (where N is again the cardinality of the set of documents in the collection). Therefore, the information content of the message "D contains q" is:

logn(q)N=logNn(q).

Now suppose we have two query terms q1 and q2. If the two terms occur in documents entirely independently of each other, then the probability of seeing both q1 and q2 in a randomly picked document D is:

n(q1)Nn(q2)N,

and the information content of such an event is:

i=12logNn(qi).

With a small variation, this is exactly what is expressed by the IDF component of BM25.

Modifications

  • At the extreme values of the coefficient b BM25 turns into ranking functions known as BM11 (for b=1) and BM15 (for b=0).[2]
  • BM25F[3] is a modification of BM25 in which the document is considered to be composed from several fields (such as headlines, main text, anchor text) with possibly different degrees of importance.

Footnotes

43 year old Petroleum Engineer Harry from Deep River, usually spends time with hobbies and interests like renting movies, property developers in singapore new condominium and vehicle racing. Constantly enjoys going to destinations like Camino Real de Tierra Adentro.

References

  • Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. In Proceedings of the Third Text REtrieval Conference (TREC 1994). Gaithersburg, USA, November 1994.
  • Stephen E. Robertson, Steve Walker, and Micheline Hancock-Beaulieu. Okapi at TREC-7. In Proceedings of the Seventh Text REtrieval Conference. Gaithersburg, USA, November 1998.
  • Karen Spärck Jones, Steve Walker, and Stephen E. Robertson. A Probabilistic Model of Information Retrieval: Development and Comparative Experiments (parts 1 and 2). Information Processing and Management, 36(6):779-840. 2000.

External links

  1. Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. An Introduction to Information Retrieval, Cambridge University Press, 2009, p. 233.
  2. http://xapian.org/docs/bm25.html
  3. Hugo Zaragoza, Nick Craswell, Michael Taylor, Suchi Saria, and Stephen Robertson. Microsoft Cambridge at TREC-13: Web and HARD tracks. In Proceedings of TREC-2004.