40_Divide_and_conquer.asciidoc 6.0 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193
  1. [[common-terms]]
  2. === Divide and conquer
  3. The terms in a query string can be divided into more important (low frequency)
  4. and less important (high frequency) terms. Documents that match only the less
  5. important terms are probably of very little interest. Really, we want
  6. documents that match as many of the more important terms as possible.
  7. The `match` query accepts a `cutoff_frequency` parameter, which allows it to
  8. divide the terms in the query string into a low frequency and high frequency
  9. group. The low frequency group (more important terms) form the bulk of the
  10. query, while the high frequency group (less important terms) is used only for
  11. scoring, not for matching. By treating these two groups differently, we can
  12. gain a real boost of speed on previously slow queries.
  13. .Domain specific stopwords
  14. *********************************************
  15. One of the benefits of `cutoff_frequency` is that you get _domain specific_
  16. stopwords for free. For instance, a website about movies may use the words
  17. ``movie'', ``color'', ``black'' and ``white'' so often that they could be
  18. considered almost meaningless. With the `stop` token filter, these domain
  19. specific terms would have to be added to the stopwords list manually. However,
  20. because the `cutoff_frequency` looks at the actual frequency of terms in the
  21. index, these words would be classified as _high frequency_ automatically.
  22. *********************************************
  23. Take this query as an example:
  24. [source,json]
  25. ---------------------------------
  26. {
  27. "match": {
  28. "text": {
  29. "query": "Quick and the dead",
  30. "cutoff_frequency": 0.01 <1>
  31. }
  32. }
  33. ---------------------------------
  34. <1> Any term that occurs in more than 1% of documents is considered to be high
  35. frequency. The `cutoff_frequency` can be specified as a fraction (`0.01`)
  36. or as an absolute number (`5`).
  37. This query uses the `cutoff_frequency` to first divide the query terms into a
  38. low frequency group: (`quick`, `dead`), and a high frequency group: (`and`,
  39. `the`). Then, the query is rewritten to produce the following `bool` query:
  40. [source,json]
  41. ---------------------------------
  42. {
  43. "bool": {
  44. "must": { <1>
  45. "bool": {
  46. "should": [
  47. { "term": { "text": "quick" }},
  48. { "term": { "text": "dead" }}
  49. ]
  50. }
  51. },
  52. "should": { <2>
  53. "bool": {
  54. "should": [
  55. { "term": { "text": "and" }},
  56. { "term": { "text": "the" }}
  57. ]
  58. }
  59. }
  60. }
  61. }
  62. ---------------------------------
  63. <1> At least one low frequency / high importance term *must* match.
  64. <2> High frequency / low importance terms are entirely optional.
  65. The `must` clause means that at least one of the low frequency terms --
  66. `quick` or `dead` -- *must* be present for a document to be considered a
  67. match. All other documents are excluded. The `should` clause then looks for
  68. the high frequency terms `and` and `the`, but only in the documents collected
  69. by the `must` clause. The sole job of the `should` clause is to score a
  70. document like ``Quick **AND THE** dead'' higher than ``**THE** quick but
  71. dead''. This approach greatly reduces the number of documents that need to be
  72. examined and scored.
  73. .`and` query
  74. ********************************
  75. Setting the operator parameter to `and` would simply make all low and high
  76. frequency terms required. As we saw in <<stopwords-and>>, this is already an
  77. efficient query.
  78. ********************************
  79. ==== Controlling precision
  80. The `minimum_should_match` parameter can be combined with `cutoff_frequency`
  81. but it only applies to the low frequency terms. This query:
  82. [source,json]
  83. ---------------------------------
  84. {
  85. "match": {
  86. "text": {
  87. "query": "Quick and the dead",
  88. "cutoff_frequency": 0.01,
  89. "minimum_should_match": "75%"
  90. }
  91. }
  92. ---------------------------------
  93. would be rewritten as:
  94. [source,json]
  95. ---------------------------------
  96. {
  97. "bool": {
  98. "must": {
  99. "bool": {
  100. "should": [
  101. { "term": { "text": "quick" }},
  102. { "term": { "text": "dead" }}
  103. ],
  104. "minimum_should_match": 1 <1>
  105. }
  106. },
  107. "should": { <2>
  108. "bool": {
  109. "should": [
  110. { "term": { "text": "and" }},
  111. { "term": { "text": "the" }}
  112. ]
  113. }
  114. }
  115. }
  116. }
  117. ---------------------------------
  118. <1> Because there are only two terms, the original 75% is rounded down
  119. to `1`, that is: ``1 out of 2 low frequency terms must match''.
  120. <2> The high frequency terms are still optional and used only for scoring.
  121. ==== Only high frequency terms
  122. An `or` query for high frequency terms only -- ``To be or not to be'' -- is
  123. the worst case for performance. It is pointless to score *all* of the
  124. documents that contain only one of these terms in order to return just the top
  125. ten matches. We are really only interested in documents where they all occur
  126. together, so in the case where there are no low frequency terms, the query is
  127. rewritten to make all high frequency terms required:
  128. [source,json]
  129. ---------------------------------
  130. {
  131. "bool": {
  132. "must": [
  133. { "term": { "text": "to" }},
  134. { "term": { "text": "be" }},
  135. { "term": { "text": "or" }},
  136. { "term": { "text": "not" }},
  137. { "term": { "text": "to" }},
  138. { "term": { "text": "be" }}
  139. ]
  140. }
  141. }
  142. ---------------------------------
  143. ==== More control with `common` terms
  144. While the high/low frequency functionality in the `match` query is useful,
  145. sometimes you want more control over how the high and low frequency groups
  146. should be handled. The `match` query just exposes a subset of the
  147. functionality available in the `common` terms query.
  148. For instance, we could make all low frequency terms, and 75% of high
  149. frequency terms required with a query like this:
  150. [source,json]
  151. ---------------------------------
  152. {
  153. "common": {
  154. "text": {
  155. "query": "Quick and the dead",
  156. "cutoff_frequency": 0.01,
  157. "low_freq_operator": "and",
  158. "minimum_should_match": {
  159. "high_freq": "75%"
  160. }
  161. }
  162. }
  163. }
  164. ---------------------------------
  165. See the {ref}query-dsl-common-terms-query.html[`common` terms query] reference
  166. page for more options.