linjinyu
/
elasticsearch-definitive-guide
mirror de https://github.com/elasticsearch-cn/elasticsearch-definitive-guide.git


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193
							[[common-terms]]
=== Divide and conquer

The terms in a query string can be divided into more important (low frequency)
and less important (high frequency) terms. Documents that match only the less
important terms are probably of very little interest.  Really, we want
documents that match as many of the more important terms as possible.

The `match` query accepts a `cutoff_frequency` parameter, which allows it to
divide the terms in the query string into a low frequency and high frequency
group. The low frequency group (more important terms) form the bulk of the
query, while the high frequency group (less important terms) is used only for
scoring, not for matching. By treating these two groups differently, we can
gain a real boost of speed on previously slow queries.

.Domain specific stopwords
*********************************************

One of the benefits of `cutoff_frequency` is that you get _domain specific_
stopwords for free. For instance, a website about movies may use the words
``movie'', ``color'', ``black'' and ``white'' so often that they could be
considered almost meaningless.  With the `stop` token filter, these domain
specific terms would have to be added to the stopwords list manually. However,
because the `cutoff_frequency` looks at the actual frequency of terms in the
index,  these words would be classified as _high frequency_ automatically.

*********************************************

Take this query as an example:

[source,json]
---------------------------------
{
  "match": {
    "text": {
      "query": "Quick and the dead",
      "cutoff_frequency": 0.01 <1>
    }
}
---------------------------------
<1> Any term that occurs in more than 1% of documents is considered to be high
    frequency. The `cutoff_frequency` can be specified as a fraction (`0.01`)
    or as an absolute number (`5`).

This query uses the `cutoff_frequency` to first divide the query terms into a
low frequency group: (`quick`, `dead`), and a high frequency group: (`and`,
`the`). Then, the query is rewritten to produce the following `bool` query:

[source,json]
---------------------------------
{
  "bool": {
    "must": { <1>
      "bool": {
        "should": [
          { "term": { "text": "quick" }},
          { "term": { "text": "dead"  }}
        ]
      }
    },
    "should": { <2>
      "bool": {
        "should": [
          { "term": { "text": "and" }},
          { "term": { "text": "the" }}
        ]
      }
    }
  }
}
---------------------------------
<1> At least one low frequency / high importance term *must* match.
<2> High frequency / low importance terms are entirely optional.

The `must` clause means that at least one of the low frequency terms --
`quick` or `dead` -- *must* be present for a document to be considered a
match. All other documents are excluded.  The `should` clause then looks for
the high frequency terms `and` and `the`,  but only in the documents collected
by the `must` clause. The sole job of the `should` clause is to score a
document like ``Quick **AND THE** dead'' higher than ``**THE** quick but
dead''.  This approach greatly reduces the number of documents that need to be
examined and scored.

.`and` query
********************************

Setting the operator parameter to `and` would simply make all low and high
frequency terms required.  As we saw in <<stopwords-and>>, this is already an
efficient query.

********************************

==== Controlling precision

The `minimum_should_match` parameter can be combined with `cutoff_frequency`
but it only applies to the low frequency terms.  This query:

[source,json]
---------------------------------
{
  "match": {
    "text": {
      "query": "Quick and the dead",
      "cutoff_frequency": 0.01,
      "minimum_should_match": "75%"
    }
}
---------------------------------

would be rewritten as:

[source,json]
---------------------------------
{
  "bool": {
    "must": {
      "bool": {
        "should": [
          { "term": { "text": "quick" }},
          { "term": { "text": "dead"  }}
        ],
        "minimum_should_match": 1 <1>
      }
    },
    "should": { <2>
      "bool": {
        "should": [
          { "term": { "text": "and" }},
          { "term": { "text": "the" }}
        ]
      }
    }
  }
}
---------------------------------
<1> Because there are only two terms, the original 75% is rounded down
    to `1`, that is: ``1 out of 2 low frequency terms must match''.
<2> The high frequency terms are still optional and used only for scoring.

==== Only high frequency terms

An `or` query for high frequency terms only -- ``To be or not to be'' -- is
the worst case for performance. It is pointless to score *all* of the
documents that contain only one of these terms in order to return just the top
ten matches. We are really only interested in documents where they all occur
together, so in the case where there are no low frequency terms, the query is
rewritten to make all high frequency terms required:

[source,json]
---------------------------------
{
  "bool": {
    "must": [
      { "term": { "text": "to" }},
      { "term": { "text": "be" }},
      { "term": { "text": "or" }},
      { "term": { "text": "not" }},
      { "term": { "text": "to" }},
      { "term": { "text": "be" }}
    ]
  }
}
---------------------------------

==== More control with `common` terms

While the high/low frequency functionality in the `match` query is useful,
sometimes you want more control over how the high and low frequency groups
should be handled.  The `match` query just exposes a subset of the
functionality available in the `common` terms query.

For instance, we could make all low frequency terms, and 75% of high
frequency terms required with a query like this:

[source,json]
---------------------------------
{
  "common": {
    "text": {
      "query":                  "Quick and the dead",
      "cutoff_frequency":       0.01,
      "low_freq_operator":      "and",
      "minimum_should_match": {
        "high_freq":            "75%"
      }
    }
  }
}
---------------------------------

See the {ref}query-dsl-common-terms-query.html[`common` terms query] reference
page for more options.