11 anos atrás · ed891a3787
--- a/230_Stemming.asciidoc
+++ b/230_Stemming.asciidoc
@@ -6,6 +6,8 @@ include::230_Stemming/20_Dictionary_stemmers.asciidoc[]
 
				 
			
 
				 include::230_Stemming/30_Hunspell_stemmer.asciidoc[]
			
 
				 
			
 
				-include::230_Stemming/40_Controlling_stemming.asciidoc[]
			
 
				+include::230_Stemming/40_Choosing_a_stemmer.asciidoc[]
			
 
				 
			
 
				-include::230_Stemming/50_Stemming_in_situ.asciidoc[]
			
 
				+include::230_Stemming/50_Controlling_stemming.asciidoc[]
			
 
				+
			
 
				+include::230_Stemming/60_Stemming_in_situ.asciidoc[]
			
--- a/230_Stemming/00_Intro.asciidoc
+++ b/230_Stemming/00_Intro.asciidoc
@@ -62,13 +62,16 @@ them.
 
				 
			
 
				 Lemmatisation is a much more complicated and expensive process that needs to
			
 
				 understand the context in which words to appear in order to make decisions
			
 
				-about what they mean. For now, stemmers are the best tools that we have
			
 
				-available.
			
 
				+about what they mean. In practice, stemming appears to be just as effective
			
 
				+as lemmatisation, but with a much lower cost.
			
 
				 
			
 
				 **********************************************
			
 
				 
			
 
				-There are two types of stemmers available: algorithmic stemmers and dictionary
			
 
				-stemmers.
			
 
				+First we will discuss the two classes of stemmers available in Elasticsearch
			
 
				+-- <<algorithmic-stemmers>> and <<dictionary-stemmers>> -- then look at how to
			
 
				+choose the right stemmer for your needs in <<choosing-a-stemmer>>.  Finally,
			
 
				+we will discuss options for tailoring stemming in <<controlling-stemming>> and
			
 
				+<<stemming-in-situ>>.
			
 
				 
			
 
				 
			
 
				 
			
--- a/230_Stemming/10_Algorithmic_stemmers.asciidoc
+++ b/230_Stemming/10_Algorithmic_stemmers.asciidoc
@@ -148,44 +148,3 @@ PUT /my_index
 
				     `light_english` stemmer.
			
 
				 <2> Added the `asciifolding` token filter.
			
 
				 
			
 
				-==== Choosing an algorithmic stemmer
			
 
				-
			
 
				-The documentation for the
			
 
				-{ref}analysis-stemmer-tokenfilter.html[`stemmer` token filter]
			
 
				-lists multiple stemmers for some languages.  For Portuguese we have:
			
 
				-
			
 
				-* `portuguese`
			
 
				-* `light_portuguese`
			
 
				-* `minimal_portuguese`
			
 
				-* `portuguese_rslp`
			
 
				-
			
 
				-For English we have:
			
 
				-
			
 
				-* `english`
			
 
				-* `light_english`
			
 
				-* `minimal_english`
			
 
				-* `lovins`
			
 
				-* `porter`
			
 
				-* `porter2`
			
 
				-* `possessive_english`
			
 
				-
			
 
				-One thing is for sure: whenever more than one solution exists for a problem,
			
 
				-it means that none of the solutions solves the problem adequately. This
			
 
				-certainly applies to stemming -- each stemmer is based on a different
			
 
				-algorithm which overstems and understems words to a different degree.
			
 
				-
			
 
				-The {ref}analysis-stemmer-tokenfilter.html[`stemmer` token filter] reference
			
 
				-documentation highlights the recommended choice for each language in bold,
			
 
				-but the recommended stemmer may not be appropriate for all use cases. It is
			
 
				-usually chosen because it offers a reasonable compromise between performance
			
 
				-and accuracy.  You may find that, for your particular use case, the
			
 
				-recommended stemmer is either too aggressive or not aggressive enough, in
			
 
				-which case you may want to try a different stemmer.
			
 
				-
			
 
				-The `light_` stemmers are less aggressive than the standard stemmers, and the
			
 
				-`minimal_` stemmers are less aggressive still. The Snowball-based stemmers
			
 
				-tend to be slower than the other hand-coded stemmers, although that very much
			
 
				-depends upon the implementation.
			
 
				-
			
 
				-Choosing the ``best'' stemmer is largely a case of trying each one out and
			
 
				-selecting the one that seems to produce the best results for your documents.
			
--- a/230_Stemming/30_Hunspell_stemmer.asciidoc
+++ b/230_Stemming/30_Hunspell_stemmer.asciidoc
@@ -176,23 +176,6 @@ shards which use the same Hunspell analyzer share the same instance.
 
				 
			
 
				 ***********************************************
			
 
				 
			
 
				-==== When to use Hunspell
			
 
				-
			
 
				-In theory, the Hunspell stemmer promises accurate, configurable stemming.  The
			
 
				-reality, sadly, falls short of the theory. The main problem is the difficulty
			
 
				-of finding high quality, up to date dictionaries, with friendly licenses. Most
			
 
				-dictionaries are incomplete and out of date.
			
 
				-
			
 
				-Hunspell tends to stem quite aggressively, reducing every word to the shortest
			
 
				-form possible.  While this does increase recall, it also reduces precision. Of
			
 
				-course, you can control the stemming process if you are willing to customize
			
 
				-your own dictionary, but that requires a lot of effort and research.
			
 
				-
			
 
				-In practice, if a good algorithmic stemmer is available for your language, it
			
 
				-makes more sense to use that rather than Hunspell.  It will be faster, consume
			
 
				-less memory and the results will generally be as good or better than with
			
 
				-Hunspell.
			
 
				-
			
 
				 [[hunspell-dictionary-format]]
			
 
				 ==== Hunspell dictionary format
			
 
				 
			
--- a/230_Stemming/40_Choosing_a_stemmer.asciidoc
+++ b/230_Stemming/40_Choosing_a_stemmer.asciidoc
@@ -0,0 +1,118 @@
 
				+:ref: http://foo.com/
			
 
				+[[choosing-a-stemmer]]
			
 
				+=== Choosing a stemmer
			
 
				+
			
 
				+The documentation for the
			
 
				+{ref}analysis-stemmer-tokenfilter.html[`stemmer`] token filter
			
 
				+lists multiple stemmers for some languages.  For English we have:
			
 
				+
			
 
				+[horizontal]
			
 
				+`english`::
			
 
				+    The {ref}analysis-porterstem-tokenfilter.html[`porter_stem`] token filter.
			
 
				+
			
 
				+`light_english`::
			
 
				+    The {ref}analysis-kstem-tokenfilter.html[`kstem`] token filter.
			
 
				+
			
 
				+`minimal_english`::
			
 
				+    The `EnglishMinimalStemmer` in Lucene, which removes plurals.
			
 
				+
			
 
				+`lovins`::
			
 
				+    The {ref}analysis-snowball-tokenfilter.html[Snowball] based
			
 
				+    http://snowball.tartarus.org/algorithms/lovins/stemmer.html[Lovins]
			
 
				+    stemmer, the first stemmer ever produced.
			
 
				+
			
 
				+`porter`::
			
 
				+    The {ref}analysis-snowball-tokenfilter.html[Snowball] based
			
 
				+    http://snowball.tartarus.org/algorithms/porter/stemmer.html[Porter] stemmer.
			
 
				+
			
 
				+`porter2`::
			
 
				+    The {ref}analysis-snowball-tokenfilter.html[Snowball] based
			
 
				+    http://snowball.tartarus.org/algorithms/english/stemmer.html[Porter2] stemmer.
			
 
				+
			
 
				+`possessive_english`::
			
 
				+    The `EnglishPossessiveFilter` in Lucene which removes `'s`.
			
 
				+
			
 
				+Add to that list the Hunspell stemmer with the various English dictionaries
			
 
				+which are available.
			
 
				+
			
 
				+One thing is for sure: whenever more than one solution exists for a problem,
			
 
				+it means that none of the solutions solves the problem adequately. This
			
 
				+certainly applies to stemming -- each stemmer uses a different approach which
			
 
				+overstems and understems words to a different degree.
			
 
				+
			
 
				+The `stemmer` documentation page highights the ``recommended'' stemmer for
			
 
				+each language in bold, usually because it offers a reasonable compromise
			
 
				+between performance and quality. That said, the recommended stemmer may not be
			
 
				+appropriate for all use cases. There is no single right answer to the question
			
 
				+of which is the ``best'' stemmer -- it depends very much on your requirements.
			
 
				+There are three factors to take into account when making a choice:
			
 
				+performance, quality and degree:
			
 
				+
			
 
				+[[stemmer-performance]]
			
 
				+==== Stemmer performance
			
 
				+
			
 
				+Algorithmic stemmers are typically four or five times faster than Hunspell
			
 
				+stemmers. ``Hand crafted'' algorithmic stemmers are usually, but not always,
			
 
				+faster than their Snowball equivalents.  For instance, the `porter_stem` token
			
 
				+filter is significantly faster than the Snowball implementation of the Porter
			
 
				+stemmer.
			
 
				+
			
 
				+Hunspell stemmers have to load all words, prefixes and suffixes into memory,
			
 
				+which can consume a few megabytes of RAM.  Algorithmic stemmers, on the other
			
 
				+hand, consist of a small amount of code and consume very little memory.
			
 
				+
			
 
				+[[stemmer-quality]]
			
 
				+==== Stemmer quality
			
 
				+
			
 
				+All languages, except Esperanto, are irregular. While more formal words tend
			
 
				+to follow a regular pattern, the most commonly used words often have their
			
 
				+irregular rules. Some stemming algorithms have been developed over years of
			
 
				+research and produce reasonably high quality results. Others have been
			
 
				+assembled more quickly with less research and deal only with the most common
			
 
				+cases.
			
 
				+
			
 
				+While Hunspell offers the promise of dealing precisely with irregular words,
			
 
				+it often falls short in practice. A dictionary stemmer is only as good as its
			
 
				+dictionary.   If Hunspell comes across a word which isn't in its dictionary it
			
 
				+can do nothing with it. Hunspell requires an extensive, high quality, up to
			
 
				+date dictionary in order to produce good results -- dictionaries of this
			
 
				+calibre are few and far between. An algorithmic stemmer, on the other hand,
			
 
				+will happily deal with new words that didn't exist when the designer created
			
 
				+the algorithm.
			
 
				+
			
 
				+If a good algorithmic stemmer is available for your language, it makes sense
			
 
				+to use it rather than Hunspell.  It will be faster, consume less memory and
			
 
				+will generall be as good or better than the Hunspell equivalent.
			
 
				+
			
 
				+If accuracy and customizability is very important to you, and you need (and
			
 
				+have the resources) to maintain a custom dictionary, then Hunspell gives you
			
 
				+greater flexibility than the algorithmic stemmers. (See
			
 
				+<<controlling-stemming>> for customization techniques which can be used with
			
 
				+any stemmer.)
			
 
				+
			
 
				+[[stemmer-degree]]
			
 
				+==== Stemmer degree
			
 
				+
			
 
				+Different stemmers overstem and understem to a different degree.  The `light_`
			
 
				+stemmers stem less aggressively than the standard stemmers, and the `minimal_`
			
 
				+stemmers less aggressively still.  Hunspell stems aggressively.
			
 
				+
			
 
				+Whether you want aggressive or light stemming depends on your use case.  If
			
 
				+your search results are being consumed by a clustering algorithm, you may
			
 
				+prefer to match more widely (and, thus, stem more aggressively).  If your
			
 
				+search results are intended for human consumption, lighter stemming usually
			
 
				+produces better results.  Stemming nouns and adjectives is more important for
			
 
				+search than stemming verbs, but this also depends on the language.
			
 
				+
			
 
				+The other factor to take into account is the size of your document corpus.
			
 
				+With a small corpus such as a catalog of 10,000 products, you probably want to
			
 
				+stem more aggressively to ensure that you match at least some documents.  If
			
 
				+your corpus is large, then it is likely you will get good matches with lighter
			
 
				+stemming.
			
 
				+
			
 
				+==== Making a choice
			
 
				+
			
 
				+Start out with a recommended stemmer.  If it works well enough, then there is
			
 
				+no need to change it.  If it doesn't, then you will need to spend some time
			
 
				+investigating and comparing the stemmers available for language in order to
			
 
				+find the one that suits your purposes best.
			
--- a/230_Stemming/50_Controlling_stemming.asciidoc
+++ b/230_Stemming/50_Controlling_stemming.asciidoc
--- a/230_Stemming/60_Stemming_in_situ.asciidoc
+++ b/230_Stemming/60_Stemming_in_situ.asciidoc