10_Fetch_phase.asciidoc 3.0 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
  1. === Fetch Phase
  2. The query phase identifies which documents satisfy((("distributed search execution", "fetch phase")))((("fetch phase of distributed search"))) the search request, but we
  3. still need to retrieve the documents themselves. This is the job of the fetch
  4. phase, shown in <<img-distrib-fetch>>.
  5. [[img-distrib-fetch]]
  6. .Fetch phase of distributed search
  7. image::images/elas_0902.png["Fetch Phase of distributed search"]
  8. The distributed phase consists of the following steps:
  9. 1. The coordinating node identifies which documents need to be fetched and
  10. issues a multi `GET` request to the relevant shards.
  11. 2. Each shard loads the documents and _enriches_ them, if required, and then
  12. returns the documents to the coordinating node.
  13. 3. Once all documents have been fetched, the coordinating node returns the
  14. results to the client.
  15. The coordinating node first decides which documents _actually_ need to be
  16. fetched. For instance, if our query specified `{ "from": 90, "size": 10 }`,
  17. the first 90 results would be discarded and only the next 10 results would
  18. need to be retrieved. These documents may come from one, some, or all of the
  19. shards involved in the original search request.
  20. The coordinating node builds a <<distrib-multi-doc,multi-get request>> for
  21. each shard that holds a pertinent document and sends the request to the same
  22. shard copy that handled the query phase.
  23. The shard loads the document bodies--the `_source` field--and, if
  24. requested, enriches the results with metadata and
  25. <<highlighting-intro,search snippet highlighting>>.
  26. Once the coordinating node receives all results, it assembles them into a
  27. single response that it returns to the client.
  28. .Deep Pagination
  29. ****
  30. The query-then-fetch process supports pagination with the `from` and `size`
  31. parameters, but _within limits_. ((("size parameter")))((("from parameter")))((("pagination", "supported by query-then-fetch process")))((("deep paging, problems with"))) Remember that each shard must build a priority
  32. queue of length `from + size`, all of which need to be passed back to
  33. the coordinating node. And the coordinating node needs to sort through
  34. `number_of_shards * (from + size)` documents in order to find the correct
  35. `size` documents.
  36. Depending on the size of your documents, the number of shards, and the
  37. hardware you are using, paging 10,000 to 50,000 results (1,000 to 5,000 pages)
  38. deep should be perfectly doable. But with big-enough `from` values, the
  39. sorting process can become very heavy indeed, using vast amounts of CPU,
  40. memory, and bandwidth. For this reason, we strongly advise against deep paging.
  41. In practice, ``deep pagers'' are seldom human anyway. A human will stop
  42. paging after two or three pages and will change the search criteria. The
  43. culprits are usually bots or web spiders that tirelessly keep fetching page
  44. after page until your servers crumble at the knees.
  45. If you _do_ need to fetch large numbers of docs from your cluster, you can
  46. do so efficiently by disabling sorting with the `scroll` query,
  47. which we discuss <<scan-scroll,later in this chapter>>.
  48. ****